Lossless Compression by Example Part 2: Huffman Coding

Last time we looked at some reasons why lossy compression is considered bad for music, and we looked at one possible quick and dirty way to compress. This time we’ll introduce the concept of lossless compression.

I’ll first point out that even if this seems like a paradoxical notion, everyone already believes it can be done. We use it all the time when we compress files on our computers by zipping them. Of course, this results in a smaller file, but no one thinks when they unzip they will have lost information. This means that there must exist ways to do lossless compression.

Today’s example is a really simple and brilliant way of doing it. It will have nothing to do with music for now, but don’t think of this as merely a toy example. Huffman coding is actually used as a step in mp3 encoding, so it relates to what we’ve been discussing.

Here’s the general idea. Suppose you want to encode (into binary) text in the most naive way possible. You assign A to 0, B to 1, C to 10, D to 11, etc. When you get to Z you’ll have 11001. This means that you have to use 5 bits for every single letter. “CAT” would be 00010 00000 10011.

To encode “CAT” we did something dumb. We only needed 3 letters, so if we had chosen ahead of time a better encoding method, maybe C = 00, A = 01, T = 10, then we could encode the text as 00 01 10. In other words, we compress our data without losing any information by a clever choice of encoding 00010 00000 10011 -> 000110.

I know your complaint already. Any sufficiently long text will contain every letter, so there is no way to do better than that original naive method. Well, you’re just not being clever enough!

Some letters will occur with more frequency than others. So if, for example, the letter “s” occurs with frequency 100 and then the next most frequent letter occurs 25 times, you will want to choose something like “01” for “s”. That way the smallest number of bits is used for the most frequent letters.

Ah, but the astute reader complains again. The reason we couldn’t do this before is because we won’t be able to tell the difference in a long string between two frequent letters: 10 01, and a single less-frequent letter: 1001. This was why we needed all 5 bits when we used the whole alphabet.

This is a uniqueness problem. What we do is not allow “01” to be a prefix on an assigned string once we’ve assigned it. This way, when we encounter 01, we stop. We know that is the letter “s” because no other letter begins “01”.

Of course, what ends up happening is that we have to go to much more than 5 bits for some letters, but the idea is that they will be used with such infrequency and the 2 and 3 bit letters used with such high frequency that it ends up saving way more space than if we stuck to 5.

Now you should be asking two questions: Is it provably smaller and is there some simple algorithm to figure out how to assign a letter to a bit sequence so that the uniqueness and smallness happens? Yes to both!

We won’t talk about proofs, since this is a series “by example.” But I think the algorithm to generate the symbol strings to encode is pretty neat.

Let’s generate the Huffman tree for “Titter Twitter Top” (just to get something with high frequency and several “repeat” frequencies).

First, make an ordered list of the letters and their frequencies: (T:7), (I:2), (E:2), (R:2), (W:1), (O:1), (P:1).

Now we will construct a binary tree with these as leaves. Start with the bottom 2 as leaves and connect them to a parent with a placeholder (*) and the sum of the frequencies. Then insert this new placeholder into the correct place on the list and remove the two you used:

Now repeat the process with the bottom two on the list (if a node is on the list already, use it in the tree):

Keep repeating this process until you’ve exhausted the list and you will get the full binary tree we will use:

Now to work out how to encode each letter, write a 0 on every left edge and a 1 on every right edge. Descend from the top to the letter you want and write the digits in order. This is the encoding. So T = 1, I = 000, R = 010, E = 011, W = 0011, O = 00101, and P = 00100. Test it out for yourself. You will find there is no ambiguity because each string of digits used for a letter never appears as a prefix of another letter.

Also, note that the letter that occurs with the highest frequency is a single bit, and the bits needed gets longer only as the frequency gets less. The encoding for Titter Twitter Top with this Huffman code is 39 bits whereas the naive encoding is 80. This compresses to half the space needed and loses no information!

We won’t get into the tedious details of how computers actually store information to see that there are lots of subtleties we’ve ignored for executing this in practice (plus we have to store the conversion table as part of the data), but at least we’ve seen an example of lossless compression in theory. Also, there was nothing special about letters here. We could do this with basically any information (for example frequencies in a sound file).