Thursday, January 5, 2012

Huffman Codes

Background

When we encode characters in computers, we assign each an 8-bit code based on an ASCII chart. But in most files, some characters appear more often than others. So wouldn't it make more sense to assign shorter codes for characters that appear more often and longer codes for characters that appear less often?

This is exactly what Claude Shannon and R.M. Fano were thinking when they created the first compression algorithm in the 1950's. However, D.A. Huffman published a paper in 1952 that improved the algorithm slightly and it soon superceded Shannon-Fano coding with the appropriately named Huffman coding.

Huffman code is a technique for compressing  data. Huffman's greedy algorithm look at the occurrence of each character and it as a binary string in an optimal way.


Huffman Tree

Huffman coding has the following properties:

  • Codes for more probable characters are shorter than ones for less probable characters.
  • Each code can be uniquely decoded





To accomplish this, Huffman coding creates what is called a "Huffman tree", which is a binary tree such as this one: 

How to read the huffman tree?
To read the codes from a Huffman tree, 
  • start from the root 
  • add a '0' every time you go left to a child, 
  • add a '1' every time you go right.
So in this example, the code for the character 'b' is 01 and the code for 'd' is 110.
 
As you can see, 'a' has a shorter code than 'd'. Notice that since all the characters are at the leafs of the tree (the ends), there is never a chance that one code will be the prefix of another one (eg. 'a' is 01 and 'b' is 011). Hence, this unique prefix property assures that each code can be uniquely decoded.


The Algorithm:How to create Huffman tree?
First count the amount of times each character appears, and assign this as a "weight" or "frequency" to each character, or node. Add all the nodes to a LIST.
Pseudocode 1: Using min heap
Then, repeat these steps until there is only one node left:







  1. Find the two nodes with the lowest weights. (this lowest or minimum weight should ring your ears for using minHeap)
  2. Create a parent node or new joint internal node for these two nodes. Give this parent node a weight of the sum of the two nodes.
  3. Remove the two nodes from the list, and add the parent node.
  4. Repeat the above 3 steps  



Pseudocode 2: Use sorting
  1. Sort the list in ascending order
  2. Get the first 2 elements 
  3. Create a parent node or new joint internal node for these two nodes. Give this parent node a weight of the sum of the two nodes.
  4. Remove the two nodes from the list, and add the parent node.
  5. Repeat the above 4 steps  
Example:

This way, the nodes with the highest weight will be near the top of the tree, and have shorter codes. Try the algorithm on the following characters:

Character  Frequency
a 7
b 6
c 5
d 2
e 1

The output of this will be tree:

Lets see how. We will use the sorting approach, as it is easy to visualize.

Iteration 1:

Our array looks like this:









From this point, you simply get the codes for each character and send the encoded string along with the tree. So the string 'abc' will be 000110.
The decoder then can use the Huffman tree to decode the string by following the paths according to the string and adding a character every time it comes to one.
Pro
Program:
// The main function that builds Huffman tree
struct MinHeapNode* buildHuffmanTree(char data[], int freq[], int size)
{
    struct MinHeapNode *left, *right, *top;
 
    // Step 1: Create a min heap of capacity equal to size.  Initially, there are
    // modes equal to size.
    struct MinHeap* minHeap = createAndBuildMinHeap(data, freq, size);
 
    // Iterate while size of heap doesn't become 1
    while (!isSizeOne(minHeap))
    {
        // Step 2: Extract the two minimum freq items from min heap
        left = extractMin(minHeap);
        right = extractMin(minHeap);
 
        // Step 3:  Create a new internal node with frequency equal to the
        // sum of the two nodes frequencies. Make the two extracted node as
        // left and right children of this new node. Add this node to the min heap
        // '$' is a special value for internal nodes, not used
        top = newNode('$', left->freq + right->freq);
        top->left = left;
        top->right = right;
        insertMinHeap(minHeap, top);
    }
 
    // Step 4: The remaining node is the root node and the tree is complete.
    return extractMin(minHeap);
}
 
// Prints huffman codes from the root of Huffman Tree.  It uses arr[] to
// store codes
void printCodes(struct MinHeapNode* root, int arr[], int top)
{
    // Assign 0 to left edge and recur
    if (root->left)
    {
        arr[top] = 0;
        printCodes(root->left, arr, top + 1);
    }
 
    // Assign 1 to right edge and recur
    if (root->right)
    {
        arr[top] = 1;
        printCodes(root->right, arr, top + 1);
    }
 
    // If this is a leaf node, then it contains one of the input
    // characters, print the character and its code from arr[]
    if (isLeaf(root))
    {
        printf("%c: ", root->data);
        printArr(arr, top);
    }
}
 
// The main function that builds a Huffman Tree and print codes by traversing
// the built Huffman Tree
void HuffmanCodes(char data[], int freq[], int size)
{
   //  Construct Huffman Tree
   struct MinHeapNode* root = buildHuffmanTree(data, freq, size);
 
   // Print Huffman codes using the Huffman tree built above
   int arr[MAX_TREE_HT], top = 0;
   printCodes(root, arr, top);
}

To call this program:
// Driver program to test above functions
int main()
{
    char arr[] = {'a', 'b', 'c', 'd', 'e',};
    int freq[] = {7, 6, 5, 2, 1};
    int size = sizeof(arr)/sizeof(arr[0]);
    HuffmanCodes(arr, freq, size);
    return 0;
}


Thanks.

0 comments:

Post a Comment