**Encoding**

Procedure for implementing the algorithm:

1. Select a block size to be used. Block size is directly related to effectiveness of encoding and inversely related to the time required. Hence a compromise has to be reached.

2. Convert the data byte stream to blocks of n bytes where n is the block size chosen.

3. The following example illustrates the procedure to be done next.

Let n=6 and the first string be “kerala”. By wrap around rotations of the string by one character each the following strings can be derived. i.e., first the string is rotated one character to the right and the result taken and then one more character and so on until when one more rotation results in recovery of the original string.

kerala |

eralak |

ralake |

alaker |

lakera |

akeral |

akeral |

alaker |

eralak |

kerala |

lakera |

ralake |

4. Do this on each block of data until the data is exhausted. It is advisable to write the encoded string followed by the primary index on to the output file. Here we have dealt with data as string but it is not possible in the actual implementation as the data stream will have in it non-alphabetic characters also. But the adapatation to be done is very trivial.

The strings should not be kept in memory unless we have a hell lot of memory. Just keep an array of indices each element of which points to a position in the block and sort them. So the first element should point to the first data string and so on.

**Decoding**

The data that we now have with us is the encoded version and the primary index. So we now have “lrkaae” and the primary index (2) with us. Now we have to prepare a vector of n elements and an array to hold the sorted version of the encoded data. Sort the encoded data encoded_data and store it in sorted_data.

So now, sorted_data=”aaeklr” and encoded_data=”lrkaae”.

No we move to prepare the vector. We give the pseudo code below.

for(int i=0;i

{

for(int j=0;j

{

if(encoded_data[j]==sorted_data[i]&& encoded_data[j] is not flagged)

{

vector[i]=j;

flag encoded_data[j];

break;

}

}

}

So here the vector is

`{3,4,5,2,0,1}`

. Now the following simple code from Mark Nelson’s article on BWT in the DDJ does the job.`index=primary_index;`

for(int i=0;i

{

output(encoded_data[index]);

index=vector[index];

}

Thus we get back “kerala”, the original string.

Now the question is, is “lrkaae” more suitable for compression compared to the original string “kerala”. Here in the encoded string both a’s have come together. When the block size chosen is large it causes more such occurrences. MTF (Move to front) encoder converts this data to one having a high frequency of characters having ASCII values less near 0. Encoding such data by an entropy encoder results in very good performance. Huffman algorithm may be used as the entropy encoder.

Sincere thanks to Mark Nelson’s article on Data compression by the Burrows Wheeler Transform from which I came to know of this fantastic algorithm.

Now the performance table using BeWT and HuffPack compared with WinZip, the famed archiver:

File | Full Size | BeWT+HuffPack | HuffPack | WinZip |

a.htm | 221514 | 061743 | 145632 | 043472 |

t.pdf | 056553 | 051808 | 055969 | 045332 |

l.exe | 040960 | 013159 | 020349 | 009680 |

notepad.exe | 053248 | 023033 | 034176 | 017769 |

## 0 comments:

## Post a Comment