Background
When we encode characters in computers, we assign each an 8-bit code based on an ASCII chart. But in most files, some characters appear more often than others. So wouldn't it make more sense to assign shorter codes for characters that appear more often and longer codes for characters that appear less often?
This is exactly what Claude Shannon and R.M. Fano were thinking when they created the first compression algorithm in the 1950's. However, D.A. Huffman published a paper in 1952 that improved the algorithm slightly and it soon superceded Shannon-Fano coding with the appropriately named Huffman coding.
Huffman code is a technique for compressing data. Huffman's greedy algorithm look at the occurrence of each character and it as a binary string in an optimal way.
Huffman Tree
Huffman coding has the following properties:
To accomplish this, Huffman coding creates what is called a "Huffman tree", which is a binary tree such as this one:
How to read the huffman tree?
To read the codes from a Huffman tree,
As you can see, 'a' has a shorter code than 'd'. Notice that since all the characters are at the leafs of the tree (the ends), there is never a chance that one code will be the prefix of another one (eg. 'a' is 01 and 'b' is 011). Hence, this unique prefix property assures that each code can be uniquely decoded.
The Algorithm:How to create Huffman tree?
First count the amount of times each character appears, and assign this as a "weight" or "frequency" to each character, or node. Add all the nodes to a LIST.
Pseudocode 1: Using min heap
Then, repeat these steps until there is only one node left:
Pseudocode 2: Use sorting
This way, the nodes with the highest weight will be near the top of the tree, and have shorter codes. Try the algorithm on the following characters:
The output of this will be tree:
Lets see how. We will use the sorting approach, as it is easy to visualize.
Iteration 1:
Our array looks like this:
From this point, you simply get the codes for each character and send the encoded string along with the tree. So the string 'abc' will be 000110.
The decoder then can use the Huffman tree to decode the string by following the paths according to the string and adding a character every time it comes to one.
To call this program:
Thanks.
When we encode characters in computers, we assign each an 8-bit code based on an ASCII chart. But in most files, some characters appear more often than others. So wouldn't it make more sense to assign shorter codes for characters that appear more often and longer codes for characters that appear less often?
This is exactly what Claude Shannon and R.M. Fano were thinking when they created the first compression algorithm in the 1950's. However, D.A. Huffman published a paper in 1952 that improved the algorithm slightly and it soon superceded Shannon-Fano coding with the appropriately named Huffman coding.
Huffman Tree
Huffman coding has the following properties:
- Codes for more probable characters are shorter than ones for less probable characters.
- Each code can be uniquely decoded
To read the codes from a Huffman tree,
- start from the root
- add a '0' every time you go left to a child,
- add a '1' every time you go right.
As you can see, 'a' has a shorter code than 'd'. Notice that since all the characters are at the leafs of the tree (the ends), there is never a chance that one code will be the prefix of another one (eg. 'a' is 01 and 'b' is 011). Hence, this unique prefix property assures that each code can be uniquely decoded.
The Algorithm:How to create Huffman tree?
First count the amount of times each character appears, and assign this as a "weight" or "frequency" to each character, or node. Add all the nodes to a LIST.
Pseudocode 1: Using min heap
Then, repeat these steps until there is only one node left:
- Find the two nodes with the lowest weights. (this lowest or minimum weight should ring your ears for using minHeap)
- Create a parent node or new joint internal node for these two nodes. Give this parent node a weight of the sum of the two nodes.
- Remove the two nodes from the list, and add the parent node.
- Repeat the above 3 steps
- Sort the list in ascending order
- Get the first 2 elements
- Create a parent node or new joint internal node for these two nodes. Give this parent node a weight of the sum of the two nodes.
- Remove the two nodes from the list, and add the parent node.
- Repeat the above 4 steps
This way, the nodes with the highest weight will be near the top of the tree, and have shorter codes. Try the algorithm on the following characters:
Character | Frequency |
a | 7 |
b | 6 |
c | 5 |
d | 2 |
e | 1 |
The output of this will be tree:
Lets see how. We will use the sorting approach, as it is easy to visualize.
Iteration 1:
Our array looks like this:
From this point, you simply get the codes for each character and send the encoded string along with the tree. So the string 'abc' will be 000110.
The decoder then can use the Huffman tree to decode the string by following the paths according to the string and adding a character every time it comes to one.
Pro
Program:// The main function that builds Huffman tree struct MinHeapNode* buildHuffmanTree(char data[], int freq[], int size) { struct MinHeapNode *left, *right, *top; // Step 1: Create a min heap of capacity equal to size. Initially, there are // modes equal to size. struct MinHeap* minHeap = createAndBuildMinHeap(data, freq, size); // Iterate while size of heap doesn't become 1 while (!isSizeOne(minHeap)) { // Step 2: Extract the two minimum freq items from min heap left = extractMin(minHeap); right = extractMin(minHeap); // Step 3: Create a new internal node with frequency equal to the // sum of the two nodes frequencies. Make the two extracted node as // left and right children of this new node. Add this node to the min heap // '$' is a special value for internal nodes, not used top = newNode('$', left->freq + right->freq); top->left = left; top->right = right; insertMinHeap(minHeap, top); } // Step 4: The remaining node is the root node and the tree is complete. return extractMin(minHeap); } // Prints huffman codes from the root of Huffman Tree. It uses arr[] to // store codes void printCodes(struct MinHeapNode* root, int arr[], int top) { // Assign 0 to left edge and recur if (root->left) { arr[top] = 0; printCodes(root->left, arr, top + 1); } // Assign 1 to right edge and recur if (root->right) { arr[top] = 1; printCodes(root->right, arr, top + 1); } // If this is a leaf node, then it contains one of the input // characters, print the character and its code from arr[] if (isLeaf(root)) { printf("%c: ", root->data); printArr(arr, top); } } // The main function that builds a Huffman Tree and print codes by traversing // the built Huffman Tree void HuffmanCodes(char data[], int freq[], int size) { // Construct Huffman Tree struct MinHeapNode* root = buildHuffmanTree(data, freq, size); // Print Huffman codes using the Huffman tree built above int arr[MAX_TREE_HT], top = 0; printCodes(root, arr, top); }
To call this program:
// Driver program to test above functions int main() { char arr[] = {'a', 'b', 'c', 'd', 'e',}; int freq[] = {7, 6, 5, 2, 1}; int size = sizeof(arr)/sizeof(arr[0]); HuffmanCodes(arr, freq, size); return 0; }
Thanks.
0 comments:
Post a Comment