User-Transparent LZ77 Steganography
Recently, demand for high-speed compression has lead to new algorithms based on pure Lempel-Ziv compression without entropy coding. Examples are Snappy and LZ4, both using byte based LZ77 algorithms. This leaves significant redundancy on the table that can be used for steganography while remaining completely transparent to the users of the cover file.
Snappy and LZ4 are two high speed compression algorithms that are based on LZ77. They are used in various applications where decompression speed is important, and operate at speeds within an order of magnitude from uncompressed data speeds (i.e. memory bandwidth). Both encode a stream of bytes using tokens of a variable number of bytes that encode a backreference and/or a literal. No explicit entropy encoding is used. The backreference often has redundancy, because there are multiple instances of the same match in the sliding search window. This can be used to store a hidden message.
Similar approaches can be, indeed have been, used with more widely used LZ compression like DEFLATE/gzip or LZMA, but due to entropy encoding (typically Huffman or arithmetic coding) that results in changes to file size and decompression time, proportional to the extra data stored if the original encoder is optimal.
Since neither Snappy nor LZ4 include entropy coding, changing to a different match within a similar window results in no change to file size and negligible changes to decompression speed. The decompression speed may change due to cache effects, but because the search windows used are small enough to fit in the processor's L1 cache (16KiB, cf. 32KiB or 64KiB L1 data cache in Intel and AMD CPU cores, and 16-64KiB in most mid-high end ARM cores), the effect is small.
Algorithm
The algorithm used scans a compressed file, decoding it and maintaining a search window similar to that used in the compressor. On each backreference token, a list of possible matches is generated. If the number of them is larger than one, i.e. alternatives to the original one are found, one of them is chosen according to message bits and the token is updated. If the number of matches is not a power of two, partial bits could be stored, but here the number was rounded down to the nearest power of two. The capacity potential of partial bits was measured, however.
Pseudo-code for writing a message:
def write_message(cover, message): for each token in cover: if token is match: match_list = list_possible_matches(token) bits = floor(log2(len(match_list))) new_match = match_list[get_message_bits(message, bits)] update_cover(cover, token, new_match) update_search_window(token)
</code>Pseudo-code for reading a message:
def read_message(cover): for each token in cover: if token is match: match_list = list_possible_matches(token) bits = floor(log2(len(match_list))) message += get_message_bits(index(token, match_list), bits)] update_search_window(token) return message
</code>Results
Different file types were compressed with LZ4 and Snappy, and the storage capacity measured. The files used were enwik8, the first 10**8 bytes of a particular version of Wikipedia text/XML; vmlinux from a recent Linux distribution, a c. 19 megabyte binary file; and lenna.ppm, an ASCII Portable PixMap file containing a popular test image. These all represent files that do benefit from simple byte oriented LZ77 compression.
The uncompressed and compressed sizes as well as the number of bytes stored and storable are shown in tables 1 and 2. Storable bytes is the number for when partial bits per match are used. Table 3 shows the compressed size and stored data as percentages or original size and cover size. It can be seen that the high redundancy lenna.ppm file can also store the largest payload. The binary vmlinux file compresses worst and also stores the smallest payload, only 5%.
Snappy has a better compression ratio with enwik8 and lenna.ppm owing to its shorter representation of close offsets of up to 10**11 bytes. This also explains the smaller steganographic storage, because there are fewer choices when a short offset is used. Interestingly, the vmlinux file compressed better with LZ4 and still had more offset redundancy. This is because LZ4 can represent alternating literals and matches very efficiently.
Table 1: Number of bytes with LZ4
File Size Compressed Stored Storable enwik8 100000000 56064065 3919102 4372158 vmlinux 19327400 7898215 374370 414794 lenna.ppm 2890108 1900343 347542 380698
</code>Table 2: Number of bytes with Snappy
File Size Compressed Stored Storable enwik8 100000000 55829854 2586298 2945139 vmlinux 19327400 8200088 259499 292220 lenna.ppm 2890108 1616291 172860 194701
</code>Table 3: Compression and storage percentages
File LZ4 size LZ4 storage Snappy size Snappy storage enwik8 56.1% 7.0% 55.8% 4.6% vmlinux 40.9% 4.7% 42.4% 3.2% lenna.ppm 65.8% 18.3% 55.9% 10.7%
</code>Code
The source code is available on github under the permissive MIT license.