Recently, demand for high-speed compression has lead to new algorithms based on pure Lempel-Ziv compression without entropy coding. Examples are Snappy and LZ4, both using byte based LZ77 algorithms. This leaves significant redundancy on the table that can be used for steganography while remaining completely transparent to the users of the cover file.

Snappy and LZ4 are two high speed compression algorithms that are based on LZ77. They are used in various applications where decompression speed is important, and operate at speeds within an order of magnitude from uncompressed data speeds (i.e. memory bandwidth). Both encode a stream of bytes using tokens of a variable number of bytes that encode a backreference and/or a literal. No explicit entropy encoding is used. The backreference often has redundancy, because there are multiple instances of the same match in the sliding search window. This can be used to store a hidden message.

Similar approaches can be, indeed have been, used with more widely used LZ compression like DEFLATE/gzip or LZMA, but due to entropy encoding (typically Huffman or arithmetic coding) that results in changes to file size and decompression time, proportional to the extra data stored if the original encoder is optimal.

Since neither Snappy nor LZ4 include entropy coding, changing to a different match within a similar window results in no change to file size and negligible changes to decompression speed. The decompression speed may change due to cache effects, but because the search windows used are small enough to fit in the processor's L1 cache (16KiB, cf. 32KiB or 64KiB L1 data cache in Intel and AMD CPU cores, and 16-64KiB in most mid-high end ARM cores), the effect is small.

Algorithm

The algorithm used scans a compressed file, decoding it and maintaining a search window similar to that used in the compressor. On each backreference token, a list of possible matches is generated. If the number of them is larger than one, i.e. alternatives to the original one are found, one of them is chosen according to message bits and the token is updated. If the number of matches is not a power of two, partial bits could be stored, but here the number was rounded down to the nearest power of two. The capacity potential of partial bits was measured, however.

Pseudo-code for writing a message:

def write_message(cover, message):
  for each token in cover:
    if token is match:
      match_list = list_possible_matches(token)
      bits = floor(log2(len(match_list)))
      new_match = match_list[get_message_bits(message, bits)]
      update_cover(cover, token, new_match)
    update_search_window(token)

Pseudo-code for reading a message:

def read_message(cover):
  for each token in cover:
    if token is match:
      match_list = list_possible_matches(token)
      bits = floor(log2(len(match_list)))
      message += get_message_bits(index(token, match_list), bits)]
    update_search_window(token)
  return message

Results

Different file types were compressed with LZ4 and Snappy, and the storage capacity measured. The files used were enwik8, the first 10**8 bytes of a particular version of Wikipedia text/XML; vmlinux from a recent Linux distribution, a c. 19 megabyte binary file; and lenna.ppm, an ASCII Portable PixMap file containing a popular test image. These all represent files that do benefit from simple byte oriented LZ77 compression.

The uncompressed and compressed sizes as well as the number of bytes stored and storable are shown in tables 1 and 2. Storable bytes is the number for when partial bits per match are used. Table 3 shows the compressed size and stored data as percentages or original size and cover size. It can be seen that the high redundancy lenna.ppm file can also store the largest payload. The binary vmlinux file compresses worst and also stores the smallest payload, only 5%.

Snappy has a better compression ratio with enwik8 and lenna.ppm owing to its shorter representation of close offsets of up to 10**11 bytes. This also explains the smaller steganographic storage, because there are fewer choices when a short offset is used. Interestingly, the vmlinux file compressed better with LZ4 and still had more offset redundancy. This is because LZ4 can represent alternating literals and matches very efficiently.

Table 1: Number of bytes with LZ4

File        Size        Compressed  Stored      Storable
enwik8      100000000   56064065    3919102     4372158
vmlinux     19327400    7898215     374370      414794
lenna.ppm   2890108     1900343     347542      380698

Table 2: Number of bytes with Snappy

File        Size        Compressed  Stored      Storable
enwik8      100000000   55829854    2586298     2945139
vmlinux     19327400    8200088     259499      292220
lenna.ppm   2890108     1616291     172860      194701

Table 3: Compression and storage percentages

File        LZ4 size    LZ4 storage  Snappy size Snappy storage
enwik8      56.1%       7.0%        55.8%       4.6%
vmlinux     40.9%       4.7%        42.4%       3.2%
lenna.ppm   65.8%       18.3%       55.9%       10.7%

Code

The source code is available on github under the permissive MIT license.