GliderJan Varho

Posts in category Programming:

Next on Pages

  • next.js
  • cloudflare
  • js
  • typescript
  • serverless
  • react

After my foray into serveless-nextjs I happily used it for personal projects for the last few years. However, Next.js moves fast and the project seems to have stalled, so I had to look into alternatives.

I've already used Cloudflare as a CDN and domain host, so I was happy to find Next-on-pages, which is a way to run Next.js serverlessly on Cloudflare Pages.

Setting up a new test project is easy, using the guide and specifically just running:

npm create cloudflare@latest test-app -- --framework=next

The interactive questions take are of everything and result in a nice development experience.

However, porting an existing site/app was a bit more of a hassle.

  1. First I had to upgrade to the latest next.js release, since the one I'd been using with sls-next was too old. Luckily there wasn't anything too bad.

  2. Next step was replacing sls with @cloudflare/next-on-pages, rewriting next.config.js and changing the runtime from serverless to edge. The last was a bit confusing, since documentation never spelled out how different things are between the app and pages routers:

    The only way I managed to make work was using:

     export const runtime = 'experimental-edge'
    

    On every page.

  3. After many an invocation of npx @cloudflare/next-on-pages I managed to make everything build. (Note! You need a "build" command in package.json which only runs next build, mine was left over from sls-next and did... other things.) Deploying was easy using Cloudflare Dashboard

  4. Once everything was provisionally working I of course wanted to use all the new features from next.js so it was another porting operation from pages/ to app/. Luckily that's reasonably well documented.

All in all my experience is so far positive. Deployments are especially pleasant: every git push is built and deployed to a temporary url and a bot posts status reports and links to pull requests.

OTOH: next.js caching is increasingly confusing with all the new features.

Serverless Next.js – React Without Hassle

  • serverless
  • react
  • next.js
  • aws

Recently I've mostly used React for front end development, but also used the serverless framework quite extensively for back end, API and event based flows. So finding Serverless Next.js was nice.

It's a serverless component which makes deploying a Next.js application to AWS CloudFront/S3/Lambda a piece of cake. No servers, no instances, no fixed costs beyond data storage. So I ported a couple of sites I manage outside work (including this one) over and liked it very much.

Getting started is as easy as running create-next-app and creating a two-line serverless.yml file.

There were a few minor issues I ran into and had to file some bug reports and pull requests for, but those were corner cases due to my specific requirements. Definitely something I am planning to use going forward.

User-Transparent LZ77 Steganography

  • LZ77
  • steganography
  • compression
  • LZ4
  • algorithm
  • entropy coding

Recently, demand for high-speed compression has lead to new algorithms based on pure Lempel-Ziv compression without entropy coding. Examples are Snappy and LZ4, both using byte based LZ77 algorithms. This leaves significant redundancy on the table that can be used for steganography while remaining completely transparent to the users of the cover file.

Snappy and LZ4 are two high speed compression algorithms that are based on LZ77. They are used in various applications where decompression speed is important, and operate at speeds within an order of magnitude from uncompressed data speeds (i.e. memory bandwidth). Both encode a stream of bytes using tokens of a variable number of bytes that encode a backreference and/or a literal. No explicit entropy encoding is used. The backreference often has redundancy, because there are multiple instances of the same match in the sliding search window. This can be used to store a hidden message.

Similar approaches can be, indeed have been, used with more widely used LZ compression like DEFLATE/gzip or LZMA, but due to entropy encoding (typically Huffman or arithmetic coding) that results in changes to file size and decompression time, proportional to the extra data stored if the original encoder is optimal.

Since neither Snappy nor LZ4 include entropy coding, changing to a different match within a similar window results in no change to file size and negligible changes to decompression speed. The decompression speed may change due to cache effects, but because the search windows used are small enough to fit in the processor's L1 cache (16KiB, cf. 32KiB or 64KiB L1 data cache in Intel and AMD CPU cores, and 16-64KiB in most mid-high end ARM cores), the effect is small.

Algorithm

The algorithm used scans a compressed file, decoding it and maintaining a search window similar to that used in the compressor. On each backreference token, a list of possible matches is generated. If the number of them is larger than one, i.e. alternatives to the original one are found, one of them is chosen according to message bits and the token is updated. If the number of matches is not a power of two, partial bits could be stored, but here the number was rounded down to the nearest power of two. The capacity potential of partial bits was measured, however.

Pseudo-code for writing a message:

def write_message(cover, message): for each token in cover: if token is match: match_list = list_possible_matches(token) bits = floor(log2(len(match_list))) new_match = match_list[get_message_bits(message, bits)] update_cover(cover, token, new_match) update_search_window(token)

</code>

Pseudo-code for reading a message:

def read_message(cover): for each token in cover: if token is match: match_list = list_possible_matches(token) bits = floor(log2(len(match_list))) message += get_message_bits(index(token, match_list), bits)] update_search_window(token) return message

</code>

Results

Different file types were compressed with LZ4 and Snappy, and the storage capacity measured. The files used were enwik8, the first 10**8 bytes of a particular version of Wikipedia text/XML; vmlinux from a recent Linux distribution, a c. 19 megabyte binary file; and lenna.ppm, an ASCII Portable PixMap file containing a popular test image. These all represent files that do benefit from simple byte oriented LZ77 compression.

The uncompressed and compressed sizes as well as the number of bytes stored and storable are shown in tables 1 and 2. Storable bytes is the number for when partial bits per match are used. Table 3 shows the compressed size and stored data as percentages or original size and cover size. It can be seen that the high redundancy lenna.ppm file can also store the largest payload. The binary vmlinux file compresses worst and also stores the smallest payload, only 5%.

Snappy has a better compression ratio with enwik8 and lenna.ppm owing to its shorter representation of close offsets of up to 10**11 bytes. This also explains the smaller steganographic storage, because there are fewer choices when a short offset is used. Interestingly, the vmlinux file compressed better with LZ4 and still had more offset redundancy. This is because LZ4 can represent alternating literals and matches very efficiently.

Table 1: Number of bytes with LZ4

File Size Compressed Stored Storable enwik8 100000000 56064065 3919102 4372158 vmlinux 19327400 7898215 374370 414794 lenna.ppm 2890108 1900343 347542 380698

</code>

Table 2: Number of bytes with Snappy

File Size Compressed Stored Storable enwik8 100000000 55829854 2586298 2945139 vmlinux 19327400 8200088 259499 292220 lenna.ppm 2890108 1616291 172860 194701

</code>

Table 3: Compression and storage percentages

File LZ4 size LZ4 storage Snappy size Snappy storage enwik8 56.1% 7.0% 55.8% 4.6% vmlinux 40.9% 4.7% 42.4% 3.2% lenna.ppm 65.8% 18.3% 55.9% 10.7%

</code>

Code

The source code is available on github under the permissive MIT license.

Fast Approximate Move-to-Front

  • algorithm
  • coding
  • MTF
  • pypy
  • Python

Move-to-front is an algorithm for coding symbols that have been seen recently with a smaller number. It’s used as a part of some data compression algorithms, e.g. after a Burrows-Wheeler transform.

I tried to approximate it in a faster way and found an algorithm that I haven’t seen elsewhere (although it’s probably not new).

Move-to-Front

In move-to-front, the encoder and decoder maintain a list of symbols, from most to least recently seen. When a new symbol is seen, that symbol is moved to the top of the list, and gets the index 0 or 1, which if seen again can be encoded with few bits, e.g. using delta coding.

The simplest implementation is an array, but in the worst case you have to move all the elements of the array, to get to least recently seen element to the top of the list. A linked list is just as bad, because to find a symbol you may have to traverse the whole list. A tree can give you logarithmic performance, but at the expense of complexity and memory use.

Here’s a simple array implementation of the encoder in Python:

def mtf_put(s): global mtf_ind, mtf_sym

i = mtf_ind[s]

for j in range(i): tmp = mtf_sym[i-j] = mtf_sym[i-j-1] mtf_ind[tmp] += 1

mtf_sym[0] = s mtf_ind[s] = 0

return i

If the alphabet is small, e.g. bytes, and most of the indices are small, performance of the array implementation is in practice quite good. However, I wanted to apply it to a large variable length alphabet, and would like constant performance. That’s where approximations come into play.

Approximate Move-to-Front

Most of the time in the array implementation is spent on moving symbols other than the one encountered to their correct places. The simplest way to avoid that would be to simply swap the symbol with the top one, but that obviously won’t work – the array will not sort, only the most recently used symbol is guaranteed to be at the top of the list, the rest can be anywhere.

A better try is to make the list cyclic. Put the newly seen symbol in front of the previous list head and update a head pointer. To avoid unbounded list growth, you can take the last symbol of the list and place it where the current symbol used to be. That introduces some "random" noise, but still lets the list sort partially:

def amtf_put(s): global N, mtf_ind, mtf_sym, mtf_pos

i = mtf_ind[s] n = (mtf_pos - i) % N

mtf_pos = (mtf_pos + 1) % N tmp = mtf_sym[i] = mtf_sym[mtf_pos] mtf_ind[tmp] = i

mtf_sym[mtf_pos] = s mtf_ind[s] = mtf_pos

return n

How good is this approximation? Intuitively, every time a symbol is moved to front, another symbol is also moved somewhere close to the front, so the algorithm is "half as good". With the correct symbol distribution that is indeed the case and the average index value about doubled (see below).

As a guarantee, a symbol that’s been seen within the previous X symbols is guaranteed to be within the first X symbols in the list, so it is at least as good as encoding backreferences.

Better Approximation?

Is there a way to improve it? A small optimization is to leave the list untouched when you see the most recent symbol again, but that only affects things if runs are very common (like after BWT).

A more effective change is to make another move when updating the array. Instead of moving the last symbol to the old position of the current one, move some other symbol from index M to that position, then move the last to position M. However, this should only be done when the current symbol was ahead of the index M, otherwise the list from M to the end is just being cycled with no order:

def amtf_put(s): global N, M, mtf_ind, mtf_sym, mtf_pos

i = mtf_ind[s] n = (mtf_pos - i) % N

if n == 0: return 0 elif n < M: mpos = (mtf_pos - M) % N mtf_pos = (mtf_pos + 1) % N tmp = mtf_sym[i] = mtf_sym[mpos] mtf_ind[tmp] = i tmp = mtf_sym[mpos] = mtf_sym[mtf_pos] mtf_ind[tmp] = mpos else: mtf_pos = (mtf_pos + 1) % N tmp = mtf_sym[i] = mtf_sym[mtf_pos] mtf_ind[tmp] = i

mtf_sym[mtf_pos] = s mtf_ind[s] = mtf_pos

return n

Which position M you should choose depends, again, on the symbol distribution. (TODO: Calculate the optimal M for when the symbol distribution is e.g. truncated exponential.)

The same guarantee as above applies – a symbol seen within X symbols is within the first X indices, because symbols can only move back by one slot.

Some Data

Trying to encode enwik8 using only bytewise MTF results in a mean index of 15.1. The 1-move approximation has a mean index of 34.1, or 33.1 when keeping the list constant with repeated symbols. The two move approximation has a minimal mean of 22.1 with M = 68 and values of M in the range [63,73] all have mean less than 22.2.

Calculating the median index instead, for MTF the result is 10, for 1-move AMTF 14, and for 2-move AMTF a minimum of 11 (for M in the range [9,20]).

Running times with pypy go from 10-11 seconds for MTF to 1-2 seconds for either AMTF. CPython is orders of magnitude slower on this kind of code, 287-288 seconds for MTF, 54-55 seconds for 1-move AMTF and 69-70 seconds for 2-move AMTF.

Working With Multiple Mercurial Queues (hg mqs)

  • Mercurial
  • hg
  • Mercurial queues

I've recently been using mercurial for revision control. I still don't like it much as git, because it tries so hard to prevent me from modifying history (modifying local history is not a bad idea) and has so many tools that do almost the same thing.

However, the MQ extension is quite nice. Even though it requires you to learn a new set of commands (where in git you'd just use another local branch as a queue), it does make modifying and moving through patches somewhat easier than git rebase -i.

This is not a tutorial in using MQs, for that see e.g. here. Instead, I'm only looking at using multiple patch queues for which very little documentation can be found online.

Creating and Moving Between Queues

The main command for interacting with queues (as opposed to patches) is the hg qqueue command. Its help output gives a list of the options you use to manage queues:

-l --list list all available queues --active print name of active queue -c --create create new queue --rename rename active queue --delete delete reference to queue --purge delete queue, and remove patch dir

Each time you use hg qq --create $QNAME to create a new patch queue and put some patches in it, a new directory is created in .hg/patches-$QNAME/. Each such directory works just like the .hg/patches/ directory does for the default queue (called 'patches').

You move between queues using hg qq $QNAME, but must first ensure no patches are applied, e.g. using hg qpop --all.

To list queues and see which is active, hg qq without options or arguments suffices. So in practice you don't normally need the -l/--list or --active options. The rest should be self-explanatory.

Moving/Copying Patches Between Queues

Since each patch queue is treated independently and you can only apply patches from one at a time, you may find you need to move or copy patches between queues. There is no special command for that, but the generic patch import command hg qimport works quite well.

Say you have a patch called 'A' in the default 'patches' queue and you want to copy it to the current queue. You can import it using hg qimport .hg/patches/A from the project's base directory. The patch gets imported and added on top on the current patch, but not applied. Add the -P/--push option to apply it immediately. Using this command you create an independent copy of a patch in the second queue so any changes in one will not be reflected in the other.

Alternatively, you can let a single copy of a patch be part of several queues using hg qimport -e ../patches/A. This way both queues point to the same patch file from their series/status files. The path part ../patches/ is actually part of the patch name in the second queue, so this is kind of a hack.

Of course, using the same patch file only really works if the patch applies cleanly in both queues (e.g. if it's the first patch in both). Remember to use hg qdelete -k/--keep if you need to delete it from one queue – I use an alias qdelete = qdelete -k. You will also not want to use hg qqueue --purge in this case!

Problems or Pitfalls

Unfortunately, the patch queues' use of different directories means that each of them has a separate repo if you use versioning, and patches that are part of two queues are separately versioned. You could probably work around this somehow if you really wanted to.

Note that guards are defined per queue. Whichever method you use to include one patch in two queues, you need to define the guards for each patch in each queue. You will also have to use hg qselect separately in each queue you work with when you e.g. move between branches.

5 Card Poker Hand Evaluation

  • BlitzMax
  • evaluation
  • graph
  • hand
  • poker
  • public domain
  • source

I found a bug in my 7 hand poker evaluator generator, which means that the tables I posted earlier, while valid and usable, are not optimal. I haven't yet rewritten the code, but the difference should be on the order of a couple of percent. Meanwhile, I used the same procedure I outlined to create a table-based representation for a five card evaluator.

The tables and a BlitzMax file are included in a compressed archive. Again, the code is not yet clean enough for me to post it here, but I'm working on it.

Download: eval5.7z (138 KB 7-zip archive) License: public domain (none)

7 Card Poker Hand Evaluation

  • BlitzMax
  • C
  • card
  • data
  • evaluation
  • graph
  • hand
  • poker

Over the past few days I've looked at poker hand evaluation, specifically 7-card evaluation for games like Texas Hold'em. What I did is based on three evaluators I've studied: Cactus Kev's 5-card evaluator, Paul Senzee's 7-card hash function and the "2+2 evaluator" by Ray Wotton et. al.

First a little background: There are (52 choose 7) = c. 134 million 7 card combinations, but less than 7500 hands with a unique value as ranked by Kevin Suffecool in the above link. The problem at hand is getting from a representation of seven individual cards to a value that can be compared to other hands - and doing that fast.

Of the three evaluators, I especially like what Paul Senzee did: he created a perfect minimal hash function that maps a 52-bit integer with seven bits set to a unique number. This can be used for a single table lookup from a 134 million element table of hand values. The downsides are that the table is huge and that the hash function is slow in comparison.

The "Cactus Kev" evaluator is a 5-card evaluator, but can naively be used to find the maximum of the 21 5-card hands in a 7-card hand. That is of course even slower, but can be used in the table generation phase of another evaluator. His code can also most easily be pasted in to another project. The code files are available at the link I gave.

In practical use, the 2+2 evaluator works like magic: seven table lookups and you are done. However, the table generation code is a mess and the table is rather large at over 100 MB. In reality, the one table represents a directed acyclic graph of seven layers, with each link representing a card. This is as close to one can come to "solving" 7-card poker.

Constructing a graph for a card sequence would mean ( 52! / (52-n)! ) nodes at the nth layer; however, the order does not matter in poker hands. The idea is the same as as behind Paul Senzee's method: the nodes representing Ah7s and 7sAh can be joined. Do the math and you again have (52 choose n) nodes at layer n.

However, there are also other nodes that are equivalent. The trick to the 2+2 evaluator is in the MakeID function of the code I linked to: suits don't always matter! For example, take a four card hand AcKdQhJs. None of the suits matter, because flush is no longer possible. Joining such nodes results in dramatic space savings.

If it were a 5 card evaluator, I believe the 2+2 evaluator (a bit tweaked) would generate an optimal DAG representation for the problem. However, as it is, it ignores some redundancy: take any ready flush - the off suit cards don't matter. Similarly, some other made hands have redundancies. A more significant memory loss is due to concatenation of all tables into one.

The way I would like to generate the DAG is: First, create it for 52 choose n, taking all cards into account. Second, apply a generic procedure to join any nodes that can be joined. This would again result in an optimal DAG, though the representation can be improved. Keeping the layers as separate tables allows for more efficient storage of those tables that don't require a full 32-bit integer for their values.

Unfortunately, the final layer would require several gigabytes (12?) at first, so it is not yet practical. I had to keep the 2+2 suit hack in, ugly though it may be. After that, layers 1 to 5 can be optimized no further, but the last two can (on the order of ten percent). As far as I can tell, the algorithm for doing that has to be at least of second order with respect to the number of nodes. I did the straightforward thing and it is indeed slow, but only has to be run once.

Storing tables 2-6 using 32-bit Ints and 7 using 16-bit Shorts the total space taken is 80 MB, in contrast to over 120 MB for the original 2+2 evaluator. The first table simply contains the index multiplied by 52, so it isn't really needed. (It annoys me greatly that table 2 almost fits in Shorts.) The 7-zip compressed file I've attached contains the tables that can be used from any programming language and samples in C and BlitzMax. I'll publish the generating code once it looks presentable.

I'm rather confident an array-based DAG is the fastest way to do an exhaustive iteration of hands in order - minimal testing shows all 134 million ordered 7-card hands can be evaluated in less than a second. However, I'm unsure whether a heavily memory dependent algorithm is the best for a Monte-Carlo simulation. It would seem that a random order would favor something with fever lookups and thus cache misses. After all, tables 5-7 are all too large to fit in current generation processor caches of up to a few MB (my previous generation dual-core has 3 MB total).

Download: eval7.7z (7.8 MB 7-zip compressed archive) License: public domain (none)

Python Raytracer in 50 Lines of Code

  • Python
  • raytracer

I've come to like the expressiveness of the Python language, though I dislike some of the other features. Here's a 50 line raytracer that renders scenes of colored spheres to a PNG file:

import math from PIL import Image

def sub(v1, v2): return [x-y for x,y in zip(v1, v2)] def dot(v1, v2): return sum([x*y for x,y in zip(v1, v2)]) def norm(v): return [x/math.sqrt(dot(v,v)) for x in v]

def trace_sphere(r, s, tmin, color): sr = sub(s[0], r[0]) dsr = dot(sr, r[1])

d = dsr*dsr - dot(sr,sr) + s[1] if d < 0: return d = math.sqrt(d)

t1 = dsr + d if t1 < 0: return

t2 = dsr - d if t2 > tmin[0]: return

if t2 < 0: tmin[0] = t1 else: tmin[0] = t2 color[:] = [s[2]]

def trace_ray(ray, spheres): color = [(0, 0, 0)] tmin = [100000] for s in spheres: trace_sphere(ray, s, tmin, color) return color[0]

red = (255,0,0) blue = (0,0,255) spheres = ( ((0,0,100), 100.0, red), ((0,-5,50), 20.0, blue) )

w, h = 200, 200

image = Image.new("RGB", (w, h)) pix = image.load()

for x in range(w): for y in range(h): ray = ( (0,0,0), norm(((x-w/2.0)/w, (y-h/2.0)/w, 1)) ) pix[x,y] = trace_ray(ray, spheres)

image.save("out.png")

Ok, it's more like 30 significant likes to tell the truth.

Comparison of LZMA and zlib (Image Data)

  • comparison
  • compression
  • images
  • LZMA
  • zlib

I created a BlitzMax program to filter* and compress images using either LZMA SDK or zlib. Here are the results for three test images I picked up from Wikipedia: Mona Lisa (painting), The Gunk (pixel art) and Grad1 (gradient). All compression was done with level 9 (smallest size).


Image Size (BMP) Size (zlib) Size (LZMA)


Painting 612 KB 347 KB = 56.7% 287 KB = 46.9% (17.3% better)

Pixel art 93 KB 13 KB = 13.4% 11 KB = 11.4% (15.2% better)

Gradient 63 KB 28 KB = 44.9% 23 KB = 36.3% (18.9% better)

Image Compress (zlib) Compress (LZMA)

Painting - ~160ms ~630ms

Pixel art - ~70ms ~90ms

Gradient - ~45ms ~75ms

Image Uncompress (zlib) Uncompress (LZMA)

Painting - ~40ms ~40ms

Pixel art - ~1ms ~1s

Gradient - ~3ms ~3ms

Times are not very exact for such small operations, but the results seem to suggest that LZMA compresses more and slower, but uncompresses as fast as zlib. That makes LZMA a good choice for precompressed data, while zlib may be better for real time compression.

Note: Filtering times are not included.

*The filter uses Paeth prediction similar to PNG.

Ray-Sphere Intersection

  • Java
  • raytracer
  • algorithm

I'm writing a small raytracer to familiarize myself with Java. Obviously many of the main algorithms need to be as fast as possible in order to eventually get to real time speeds.

Here's my optimized version of a ray-sphere intersect check. It can probably be tweaked further: 9 additions, 7 multiplications best case; and 12 additions, 7 multiplications and a square root worst case is the current speed/complexity.

The method only calculates t - the coefficient of the ray direction vector. That's because there is no use calculating the actual intersect position before the smallest t has been found.

Feel free to use however you like.

public double intersectRay(double[] ro, double[] rd, double minDist) { double px = ro[0]-position[0]; double py = ro[1]-position[1]; double pz = ro[2]-position[2]; double b0 = rd[0]*px + rd[1]*py + rd[2]*pz; double d0 = b0*b0 + size - px*px - py*py - pz*pz;

// No intersect if (d0<0) return 0;

b0 = -b0; d0 = Math.sqrt(d0); double t = b0 - d0; if (t > EPSILON) { // t0

// Too far if (t > minDist) return 0; } else { // t1 t = b0 + d0;

// Behind or too far if (t <= EPSILON || t > minDist) return 0; }

// Valid intersect return t; }</code>