Jan Varho

User-Transparent LZ77 Steganography

29.4.2014

LZ77
steganography
compression
LZ4
algorithm
entropy coding

Recently, demand for high-speed compression has lead to new algorithms based on pure Lempel-Ziv compression without entropy coding. Examples are Snappy and LZ4, both using byte based LZ77 algorithms. This leaves significant redundancy on the table that can be used for steganography while remaining completely transparent to the users of the cover file.

Snappy and LZ4 are two high speed compression algorithms that are based on LZ77. They are used in various applications where decompression speed is important, and operate at speeds within an order of magnitude from uncompressed data speeds (i.e. memory bandwidth). Both encode a stream of bytes using tokens of a variable number of bytes that encode a backreference and/or a literal. No explicit entropy encoding is used. The backreference often has redundancy, because there are multiple instances of the same match in the sliding search window. This can be used to store a hidden message.

Similar approaches can be, indeed have been, used with more widely used LZ compression like DEFLATE/gzip or LZMA, but due to entropy encoding (typically Huffman or arithmetic coding) that results in changes to file size and decompression time, proportional to the extra data stored if the original encoder is optimal.

Since neither Snappy nor LZ4 include entropy coding, changing to a different match within a similar window results in no change to file size and negligible changes to decompression speed. The decompression speed may change due to cache effects, but because the search windows used are small enough to fit in the processor's L1 cache (16KiB, cf. 32KiB or 64KiB L1 data cache in Intel and AMD CPU cores, and 16-64KiB in most mid-high end ARM cores), the effect is small.

Algorithm

The algorithm used scans a compressed file, decoding it and maintaining a search window similar to that used in the compressor. On each backreference token, a list of possible matches is generated. If the number of them is larger than one, i.e. alternatives to the original one are found, one of them is chosen according to message bits and the token is updated. If the number of matches is not a power of two, partial bits could be stored, but here the number was rounded down to the nearest power of two. The capacity potential of partial bits was measured, however.

Pseudo-code for writing a message:

def write_message(cover, message): for each token in cover: if token is match: match_list = list_possible_matches(token) bits = floor(log2(len(match_list))) new_match = match_list[get_message_bits(message, bits)] update_cover(cover, token, new_match) update_search_window(token)

</code>

Pseudo-code for reading a message:

def read_message(cover): for each token in cover: if token is match: match_list = list_possible_matches(token) bits = floor(log2(len(match_list))) message += get_message_bits(index(token, match_list), bits)] update_search_window(token) return message

</code>

Results

Different file types were compressed with LZ4 and Snappy, and the storage capacity measured. The files used were enwik8, the first 10**8 bytes of a particular version of Wikipedia text/XML; vmlinux from a recent Linux distribution, a c. 19 megabyte binary file; and lenna.ppm, an ASCII Portable PixMap file containing a popular test image. These all represent files that do benefit from simple byte oriented LZ77 compression.

The uncompressed and compressed sizes as well as the number of bytes stored and storable are shown in tables 1 and 2. Storable bytes is the number for when partial bits per match are used. Table 3 shows the compressed size and stored data as percentages or original size and cover size. It can be seen that the high redundancy lenna.ppm file can also store the largest payload. The binary vmlinux file compresses worst and also stores the smallest payload, only 5%.

Snappy has a better compression ratio with enwik8 and lenna.ppm owing to its shorter representation of close offsets of up to 10**11 bytes. This also explains the smaller steganographic storage, because there are fewer choices when a short offset is used. Interestingly, the vmlinux file compressed better with LZ4 and still had more offset redundancy. This is because LZ4 can represent alternating literals and matches very efficiently.

Table 1: Number of bytes with LZ4

File Size Compressed Stored Storable enwik8 100000000 56064065 3919102 4372158 vmlinux 19327400 7898215 374370 414794 lenna.ppm 2890108 1900343 347542 380698

</code>

Table 2: Number of bytes with Snappy

File Size Compressed Stored Storable enwik8 100000000 55829854 2586298 2945139 vmlinux 19327400 8200088 259499 292220 lenna.ppm 2890108 1616291 172860 194701

</code>

Table 3: Compression and storage percentages

File LZ4 size LZ4 storage Snappy size Snappy storage enwik8 56.1% 7.0% 55.8% 4.6% vmlinux 40.9% 4.7% 42.4% 3.2% lenna.ppm 65.8% 18.3% 55.9% 10.7%

</code>

Code

The source code is available on github under the permissive MIT license.

Filed under: Programming

Fast Approximate Move-to-Front

4.3.2014

algorithm
coding
MTF
pypy
Python

Move-to-front is an algorithm for coding symbols that have been seen recently with a smaller number. It’s used as a part of some data compression algorithms, e.g. after a Burrows-Wheeler transform.

I tried to approximate it in a faster way and found an algorithm that I haven’t seen elsewhere (although it’s probably not new).

Move-to-Front

In move-to-front, the encoder and decoder maintain a list of symbols, from most to least recently seen. When a new symbol is seen, that symbol is moved to the top of the list, and gets the index 0 or 1, which if seen again can be encoded with few bits, e.g. using delta coding.

The simplest implementation is an array, but in the worst case you have to move all the elements of the array, to get to least recently seen element to the top of the list. A linked list is just as bad, because to find a symbol you may have to traverse the whole list. A tree can give you logarithmic performance, but at the expense of complexity and memory use.

Here’s a simple array implementation of the encoder in Python:

def mtf_put(s): global mtf_ind, mtf_sym

i = mtf_ind[s]

for j in range(i): tmp = mtf_sym[i-j] = mtf_sym[i-j-1] mtf_ind[tmp] += 1

mtf_sym[0] = s mtf_ind[s] = 0

return i

If the alphabet is small, e.g. bytes, and most of the indices are small, performance of the array implementation is in practice quite good. However, I wanted to apply it to a large variable length alphabet, and would like constant performance. That’s where approximations come into play.

Approximate Move-to-Front

Most of the time in the array implementation is spent on moving symbols other than the one encountered to their correct places. The simplest way to avoid that would be to simply swap the symbol with the top one, but that obviously won’t work – the array will not sort, only the most recently used symbol is guaranteed to be at the top of the list, the rest can be anywhere.

A better try is to make the list cyclic. Put the newly seen symbol in front of the previous list head and update a head pointer. To avoid unbounded list growth, you can take the last symbol of the list and place it where the current symbol used to be. That introduces some "random" noise, but still lets the list sort partially:

def amtf_put(s): global N, mtf_ind, mtf_sym, mtf_pos

i = mtf_ind[s] n = (mtf_pos - i) % N

mtf_pos = (mtf_pos + 1) % N tmp = mtf_sym[i] = mtf_sym[mtf_pos] mtf_ind[tmp] = i

mtf_sym[mtf_pos] = s mtf_ind[s] = mtf_pos

return n

How good is this approximation? Intuitively, every time a symbol is moved to front, another symbol is also moved somewhere close to the front, so the algorithm is "half as good". With the correct symbol distribution that is indeed the case and the average index value about doubled (see below).

As a guarantee, a symbol that’s been seen within the previous X symbols is guaranteed to be within the first X symbols in the list, so it is at least as good as encoding backreferences.

Better Approximation?

Is there a way to improve it? A small optimization is to leave the list untouched when you see the most recent symbol again, but that only affects things if runs are very common (like after BWT).

A more effective change is to make another move when updating the array. Instead of moving the last symbol to the old position of the current one, move some other symbol from index M to that position, then move the last to position M. However, this should only be done when the current symbol was ahead of the index M, otherwise the list from M to the end is just being cycled with no order:

def amtf_put(s): global N, M, mtf_ind, mtf_sym, mtf_pos

i = mtf_ind[s] n = (mtf_pos - i) % N

if n == 0: return 0 elif n < M: mpos = (mtf_pos - M) % N mtf_pos = (mtf_pos + 1) % N tmp = mtf_sym[i] = mtf_sym[mpos] mtf_ind[tmp] = i tmp = mtf_sym[mpos] = mtf_sym[mtf_pos] mtf_ind[tmp] = mpos else: mtf_pos = (mtf_pos + 1) % N tmp = mtf_sym[i] = mtf_sym[mtf_pos] mtf_ind[tmp] = i

mtf_sym[mtf_pos] = s mtf_ind[s] = mtf_pos

return n

Which position M you should choose depends, again, on the symbol distribution. (TODO: Calculate the optimal M for when the symbol distribution is e.g. truncated exponential.)

The same guarantee as above applies – a symbol seen within X symbols is within the first X indices, because symbols can only move back by one slot.

Some Data

Trying to encode enwik8 using only bytewise MTF results in a mean index of 15.1. The 1-move approximation has a mean index of 34.1, or 33.1 when keeping the list constant with repeated symbols. The two move approximation has a minimal mean of 22.1 with M = 68 and values of M in the range [63,73] all have mean less than 22.2.

Calculating the median index instead, for MTF the result is 10, for 1-move AMTF 14, and for 2-move AMTF a minimum of 11 (for M in the range [9,20]).

Running times with pypy go from 10-11 seconds for MTF to 1-2 seconds for either AMTF. CPython is orders of magnitude slower on this kind of code, 287-288 seconds for MTF, 54-55 seconds for 1-move AMTF and 69-70 seconds for 2-move AMTF.

Filed under: Programming

Working With Multiple Mercurial Queues (hg mqs)

20.8.2013

Mercurial
hg
Mercurial queues

I've recently been using mercurial for revision control. I still don't like it much as git, because it tries so hard to prevent me from modifying history (modifying local history is not a bad idea) and has so many tools that do almost the same thing.

However, the MQ extension is quite nice. Even though it requires you to learn a new set of commands (where in git you'd just use another local branch as a queue), it does make modifying and moving through patches somewhat easier than git rebase -i.

This is not a tutorial in using MQs, for that see e.g. here. Instead, I'm only looking at using multiple patch queues for which very little documentation can be found online.

Creating and Moving Between Queues

The main command for interacting with queues (as opposed to patches) is the hg qqueue command. Its help output gives a list of the options you use to manage queues:

-l --list list all available queues --active print name of active queue -c --create create new queue --rename rename active queue --delete delete reference to queue --purge delete queue, and remove patch dir

Each time you use hg qq --create $QNAME to create a new patch queue and put some patches in it, a new directory is created in .hg/patches-$QNAME/. Each such directory works just like the .hg/patches/ directory does for the default queue (called 'patches').

You move between queues using hg qq $QNAME, but must first ensure no patches are applied, e.g. using hg qpop --all.

To list queues and see which is active, hg qq without options or arguments suffices. So in practice you don't normally need the -l/--list or --active options. The rest should be self-explanatory.

Moving/Copying Patches Between Queues

Since each patch queue is treated independently and you can only apply patches from one at a time, you may find you need to move or copy patches between queues. There is no special command for that, but the generic patch import command hg qimport works quite well.

Say you have a patch called 'A' in the default 'patches' queue and you want to copy it to the current queue. You can import it using hg qimport .hg/patches/A from the project's base directory. The patch gets imported and added on top on the current patch, but not applied. Add the -P/--push option to apply it immediately. Using this command you create an independent copy of a patch in the second queue so any changes in one will not be reflected in the other.

Alternatively, you can let a single copy of a patch be part of several queues using hg qimport -e ../patches/A. This way both queues point to the same patch file from their series/status files. The path part ../patches/ is actually part of the patch name in the second queue, so this is kind of a hack.

Of course, using the same patch file only really works if the patch applies cleanly in both queues (e.g. if it's the first patch in both). Remember to use hg qdelete -k/--keep if you need to delete it from one queue – I use an alias qdelete = qdelete -k. You will also not want to use hg qqueue --purge in this case!

Problems or Pitfalls

Unfortunately, the patch queues' use of different directories means that each of them has a separate repo if you use versioning, and patches that are part of two queues are separately versioned. You could probably work around this somehow if you really wanted to.

Note that guards are defined per queue. Whichever method you use to include one patch in two queues, you need to define the guards for each patch in each queue. You will also have to use hg qselect separately in each queue you work with when you e.g. move between branches.

Filed under: Programming

Remote Control Rhythmbox from Command Line

11.3.2012

Rhythmbox
ssh
HTPC

I have a mostly-headless HTPC without keyboard, so it is easier to control it remotely. VNC lets you do anything, but for simple things a command line is faster.

My media player of choice is rhythmbox, so there's an easy way to control it: rhythmbox-control. However, it requires an X session and ssh -X opens a new one. Here's how to use it to control a rhythmbox instance that is already running in the main display:

[code]alias rthythmbox-remote="env DISPLAY=:0.0 rhythmbox-control $1"[/code]

For example:

[code]ssh -p 1234 computer.local rhythmbox-remote --next[/code]

Filed under: Linux

Natural Order String Comparisons in C

16.9.2011

algorithms
C
MIT license
sort
source
strcmp
strnatcmp

I wrote a simple natural string order comparison function in C. I later found out I basically reimplemented (a part of) Martin Pool's strnatcmp, so I renamed mine that as well.

The idea is that in many cases the string "foo10bar" should sort after "foo9bar" instead of in between "foo1bar" and "foo2bar" as happens with strcmp.

I also made "foo02bar" sort after "foo1bar", but made sure "foo0.02bar" still sorts before "foo0.1bar" by considering anything beginning with dot a decimal.

Martin Pool's version also does something with lower/uppercase and leading space but I didn't need that, so mine is simpler and probably faster.

Examples:

strcmp("1", "2") = -1, strnatcmp("1", "2") = -1
strcmp("12", "1a") = -47, strnatcmp("12", "1a") = 1
strcmp("02", "1") = -1, strnatcmp("02", "1") = 1
strcmp("1 ", "1") = 32, strnatcmp("1 ", "1") = 32 
strcmp("0.02", "0.1") = -1, strnatcmp("0.02", "0.1") = -1
strcmp("0.02a", "0.0212a") = 48, strnatcmp("0.02a", "0.0212") = -2

Download: strnatcmp.c (2 KB) License: MIT license (Expat)

Filed under: C

« Previous page