Extremely simple data compression

Here is an extremely simple code for a data compression method in a handful of lines of Python language computer code. For my future self, I have posted it here in hopes of understanding the basics of this method.

It is based on Jarek Duda’s paper, “Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding” which is available at Cornell University Library’s ArXiv project: https://arxiv.org/abs/1311.2540

It’s using his “Uniform asymmetric binary systems (uABS)” which begins on about page 7. Some details on his paper were a bit hard for me to understand so I improvised a little.

Basically you transform your input data, which is a string of 0s and 1s, into a single output integer. That’s called Encoding, its done by repeatedly running the Encoder function C on the input. At each step, you add a single symbol from input, and generate a new integer. The final integer will have less ‘bits’ than the input data if you are clever about looking at whether 0 occurs more often than 1 in the input, and how much more often. Another explanation is here by Duda himself : https://encode.ru/threads/1821-Asymetric-Numeral-System?p=41539&viewfull=1#post41539

The theory, if I understand, is based on the idea that we can look at a Number System, and imagine how much work it takes to write down any given integer in that system. In the decimal system, for example, the number 100 takes 3 digits to write down, and the number ten takes 2. The number 999 takes 3 digits as well. In the binary system, every number also takes a certain number of digits. 2 is ’10’ which takes two digits. 4 is ‘100’ which is three digits. 5 is ‘101’ which is also three digits. Basically, though, there is a pretty regular pattern. You can pretty easily tell how many digits a number will take – and all numbers are treated basically the same. 5 and 6 ‘101’ and ‘110’ have the same number of digits.

In an asymmetric number system, things are different. Some numbers are considered ‘more likely’ than other numbers. Those ‘more likely’ numbers are represented by fewer digits. You can even measure this ‘likelihood’ at runtime. I don’t actually understand the details yet – but you don’t have to to write an implementation. The algorithm is so simple.

If you want to recreate the original data from the ‘encoded’ version, you have to repeatedly use the Decoder function D. It takes in a big integer, and splits it into a smaller integer and a symbol. This repeats, picking off symbols one by one to reproduce the original data. The data will be backwards by the way!

The funny thing is that it only works when there is an unbalanced ratio of 1s and 0s in the input data… if you have a relatively even number of 1s and 0s it won’t compress at all! It basically just flips the bits!! However, later on I will show a sample of 36% compression.

Here is the python code, in the Python 3 version of that computer language.

from fractions import Fraction as F
from collections import Counter
import math

prefprob = 0

def findprobs(data):
	for symbol in data:
		counts[ symbol ]+=1
	probs = {}
	for symbol,count in counts.items():
	return probs

def C(s,x):
	if s==0: return math.ceil(F(x+1,1-p))-1
	elif s==1: return math.floor(F(x,p))
	else: print("error")

def D(x):
	s = math.ceil((x+1)*p)-math.ceil(x*p)
	if s==1: return s, math.ceil(p*x)
	elif s==0: return s, math.ceil((1-p)*x)-1
	else: print("error")
x = 1
probs = findprobs(input)

for s in input:
	x = C(int(s),x)
	print(" x':",x)
finalx = x
stringfinalx = '{0:b}'.format(finalx)

outmsg = ''
xp = x
for t in range(len(input)):
	print("x':",x,end=' ')
	s, x = D(xp)
	outmsg += str(s)
	xp = x
print('input  ',input)
print('input length:',len(input),'encoded length:',len(stringfinalx))
print('compression ratio: {0:.0f}%'.format(ratio*100))

Below is some example output for a small input string. The program prints the progress of the encoding process step by step.

The ‘s’ is the input symbol currently being worked on from the input data, the ‘x’ is the ‘current’ integer we are working with, which is set to 1 at the beginning by the program itself before any processing occurs. x’, also called x-prime, is the ‘new’ x that is created. After x-prime is created, it becomes the new x. The process repeats until all symbols are read from input. The final integer is printed. It should have fewer ‘bits’ than the input data if everything went according to plan.

Then comes decoding. The final integer is now called x-prime. X-prime is split into a single symbol s, and a smaller integer named x. Now, x becomes the new x-prime, and the process repeats. The symbols, together, should be the same as the original input, but backwards. This is how we prove to ourselves that our encoder and decoder work properly.

Finally, it prints some information, including what percentage the data compressed.

don@oysters:~/Downloads$ python t.py 

s: 1 x: 1 x': 3
s: 0 x: 3 x': 5
s: 0 x: 5 x': 8
s: 1 x: 8 x': 29
s: 0 x: 29 x': 41
s: 1 x: 41 x': 150
s: 0 x: 150 x': 207
s: 0 x: 207 x': 285
s: 0 x: 285 x': 393
s: 0 x: 393 x': 541
s: 0 x: 541 x': 745

x': 745 s: 0 x: 541
x': 541 s: 0 x: 393
x': 393 s: 0 x: 285
x': 285 s: 0 x: 207
x': 207 s: 0 x: 150
x': 150 s: 1 x: 41
x': 41 s: 0 x: 29
x': 29 s: 1 x: 8
x': 8 s: 0 x: 5
x': 5 s: 0 x: 3
x': 3 s: 1 x: 1

input   10010100000
probabilities: {'1': Fraction(3, 11), '0': Fraction(8, 11)}
encoded 1011101001
decoded 10010100000
compression ratio: 91%

You can see here we have basically shrunk the input data by one bit out of eleven bits, or roughly a 91% compression ratio.

That’s interesting, but what if we use different input? How about a simple bitmap of the logo of the University of Oklahoma?


Now we have 22 ‘1’s out of 333 total input symbols. After re-running, the following output shows:

input length: 333 encoded length: 121
compression ratio: 36%

36%! That’s not too shabby for such a simple encoder/decoder.

Hope that helps someone get a grasp on Asymmetric Number System compression (i.e., future me.). Hopefully I didn’t mess it up. Thanks for reading.