Huffman Codes ~ KodeKnight

Thursday, January 5, 2012

Huffman Codes

January 05, 2012 Algorithms, Dynamic Programming / DP, incomplete, kodeknight No comments

Background

When we encode characters in computers, we assign each an 8-bit code based on an ASCII chart. But in most files, some characters appear more often than others. So wouldn't it make more sense to assign shorter codes for characters that appear more often and longer codes for characters that appear less often?

This is exactly what Claude Shannon and R.M. Fano were thinking when they created the first compression algorithm in the 1950's. However, D.A. Huffman published a paper in 1952 that improved the algorithm slightly and it soon superceded Shannon-Fano coding with the appropriately named Huffman coding.

Huffman code is a technique for compressing data. Huffman's greedy algorithm look at the occurrence of each character and it as a binary string in an optimal way.

Huffman Tree

Huffman coding has the following properties:

Codes for more probable characters are shorter than ones for less probable characters.
Each code can be uniquely decoded

To accomplish this, Huffman coding creates what is called a "Huffman tree", which is a binary tree such as this one:

How to read the huffman tree?
To read the codes from a Huffman tree,

start from the root
add a '0' every time you go left to a child,
add a '1' every time you go right.

So in this example, the code for the character 'b' is 01 and the code for 'd' is 110.

As you can see, 'a' has a shorter code than 'd'. Notice that since all the characters are at the leafs of the tree (the ends), there is never a chance that one code will be the prefix of another one (eg. 'a' is 01 and 'b' is 011). Hence, this unique prefix property assures that each code can be uniquely decoded.

The Algorithm:How to create Huffman tree?
First count the amount of times each character appears, and assign this as a "weight" or "frequency" to each character, or node. Add all the nodes to a LIST.
Pseudocode 1: Using min heap
Then, repeat these steps until there is only one node left:

Find the two nodes with the lowest weights. (this lowest or minimum weight should ring your ears for using minHeap)
Create a parent node or new joint internal node for these two nodes. Give this parent node a weight of the sum of the two nodes.
Remove the two nodes from the list, and add the parent node.
Repeat the above 3 steps

Pseudocode 2: Use sorting

Sort the list in ascending order
Get the first 2 elements
Create a parent node or new joint internal node for these two nodes. Give this parent node a weight of the sum of the two nodes.
Remove the two nodes from the list, and add the parent node.
Repeat the above 4 steps

Example:

This way, the nodes with the highest weight will be near the top of the tree, and have shorter codes. Try the algorithm on the following characters:

Character	Frequency
a	7
b	6
c	5
d	2
e	1

The output of this will be tree:

Lets see how. We will use the sorting approach, as it is easy to visualize.

Iteration 1:

Our array looks like this:

From this point, you simply get the codes for each character and send the encoded string along with the tree. So the string 'abc' will be 000110.
The decoder then can use the Huffman tree to decode the string by following the paths according to the string and adding a character every time it comes to one.

Pro

Program:

// The main function that builds Huffman tree
struct MinHeapNode* buildHuffmanTree(char data[], int freq[], int size)
{
    struct MinHeapNode *left, *right, *top;
 
    // Step 1: Create a min heap of capacity equal to size.  Initially, there are
    // modes equal to size.
    struct MinHeap* minHeap = createAndBuildMinHeap(data, freq, size);
 
    // Iterate while size of heap doesn't become 1
    while (!isSizeOne(minHeap))
    {
        // Step 2: Extract the two minimum freq items from min heap
        left = extractMin(minHeap);
        right = extractMin(minHeap);
 
        // Step 3:  Create a new internal node with frequency equal to the
        // sum of the two nodes frequencies. Make the two extracted node as
        // left and right children of this new node. Add this node to the min heap
        // '$' is a special value for internal nodes, not used
        top = newNode('$', left->freq + right->freq);
        top->left = left;
        top->right = right;
        insertMinHeap(minHeap, top);
    }
 
    // Step 4: The remaining node is the root node and the tree is complete.
    return extractMin(minHeap);
}
 
// Prints huffman codes from the root of Huffman Tree.  It uses arr[] to
// store codes
void printCodes(struct MinHeapNode* root, int arr[], int top)
{
    // Assign 0 to left edge and recur
    if (root->left)
    {
        arr[top] = 0;
        printCodes(root->left, arr, top + 1);
    }
 
    // Assign 1 to right edge and recur
    if (root->right)
    {
        arr[top] = 1;
        printCodes(root->right, arr, top + 1);
    }
 
    // If this is a leaf node, then it contains one of the input
    // characters, print the character and its code from arr[]
    if (isLeaf(root))
    {
        printf("%c: ", root->data);
        printArr(arr, top);
    }
}
 
// The main function that builds a Huffman Tree and print codes by traversing
// the built Huffman Tree
void HuffmanCodes(char data[], int freq[], int size)
{
   //  Construct Huffman Tree
   struct MinHeapNode* root = buildHuffmanTree(data, freq, size);
 
   // Print Huffman codes using the Huffman tree built above
   int arr[MAX_TREE_HT], top = 0;
   printCodes(root, arr, top);
}

To call this program:

// Driver program to test above functions
int main()
{
    char arr[] = {'a', 'b', 'c', 'd', 'e',};
    int freq[] = {7, 6, 5, 2, 1};
    int size = sizeof(arr)/sizeof(arr[0]);
    HuffmanCodes(arr, freq, size);
    return 0;
}

Thanks.

KodeKnight

Like Us

Followers

Topics

Books for You

Contributors

Search This Blog

Labels

Report Abuse

Blog Archive

Blogger news

Blogroll

ADs

Blogger news

Thursday, January 5, 2012

Huffman Codes

0 comments:

Post a Comment

Labels

Likeus