Let's play the Compression Game
- Alphabet: A = {a,b,c, ... } (Set of tokens)
- Sequence: S = (t1, t2, t3, ...) (String of tokens)
- Size of Alphabet: |A|
- Encodings (ASCII, Unicode, Binary ...)
- Mapping Token Sequences to Turtle Graphics
- Absolute direction corresponding to a token
- Relative direction corresponding to a token
The original image is on the left, runlength code in the center, decoded image on the right.
Each pixel corresponds to one token (or one byte).
Original image on the left, glitched runlength code in the center, decoded image on the right.
Identify repeating Tokens.
- Scale Level: Tokens
- rle(
11111111111111111111
, 1) → 91
+ 91
+ 21
→919192
- rle(
11111222333333333333
, 1) → 51
+ 32
+ 93
+ 33
→51329333
- rle(
12345678912345678912
, 1) → 11
+ 12
+ 13
+ 11
+ 12
+ 13
→112131415161718191112131415161718191112
- rle(
123456789123456789
, -1) → 12
+ 34
+ 56
+ 78
+ 91
+ 23
+ 45
+ 67
+ 89
=2444666668888888133555577777799999999
-
How to save the RLE as a sequence of tokens?
- Fixed Token Size + Maximum Repeat
- Fixed Token Size + Escape Codes (See also: UTF8)
- Seperator Token
Identify Differences between successive Tokens
- Scale Level: Ordered Tokens, Circluar Ordered Tokens
- delta(
11111111111111111111
, 1) →10000000000000000000
- delta(
11111222333333333333
, 1) →10000100100000000000
- delta(
12345678912345678912
, 1) →11111111121111111121
- delta(
00000000000000000000
, -1) →00000000000000000000
- delta(
11111111111111111111
, -1) →1234567890123456789)0
- Identify Sequences of Tokens encountered before
- idx(
123123123123
, 1) →1
+2
+3
+ mem(-3,3) + mem(-3, 3) →0102033333
- idx(
12345678912345678912
) →1
+2
+3
+4
+5
+6
+7
+8
+9
→ mem(9, 9) + mem(9,2) →0102030405060708099992
- idx(
0102033366
) →123
+ idx(3366
) →123123
+ idx(66
) →123123123123
- Replace frequently used tokens by new tokens (i.e. Frequency Encoding)
- Rewrite results as a Grammar
- (1212121212121212121212) → (
33333333333
,3
→12
) - (1122112211221122112211) → (
34343434343
,3
→11
,4
→22
)
- Lempel-Ziv-Welch Encoding (LZW) = Frequency + Index Encoding
- ZIP files!
- Lempel-Ziv-Welch (LZW) on Wikipedia
- The Algorithm on Wikipedia
- Sequitur Website + Live Demo + Code in various languages!
- Inferring Sequential Structure – PHD Thesis of Nevill-Manning (PDF)
- Upwrite Predictor Project Site on Archive.org
- The UpWrite Predictor A General Grammatical Inference Engine for Symbolic Time Series, with Applications in Natural Language Acquisition and Data Compression – PHD Thesis of Jason Hutchens (PDF)
Exercises can be found here.
Solutions can be found here.
- Papers on Grammar Based Coding
- Compression.RU – excellent resource (Russian)