Tuesday, January 28, 2014

Find the k most frequent words from a file

This updated information has been further expanded upon on my new website. You can find the updated details here: https://k5kc.com/cs/algorithms/top-k-frequent-words-problem/.
Case 1 - Consider the file is not big
The question can be described as: Input: A positive integer K and a text file which can stay in-memory. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence. Output: The most frequent K words in the text.
My thinking is like this.
1) use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
2) sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
3) After sorting, we just take the first K words. This takes O(K) time.
To summarize, the total time is O(n+n*lg(n)+K), Since K is surely smaller than N, so it is actually O(n*lg(n)).
We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be
2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;
3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).
To summarize, this solution cost time O(n+k*lg(n)).

Case 2 - When the file size is too big to fit in-memory
Now this can be solved by 2 ways
Solution 1 - Caching
Break the file in to n components, so that they can fit in memory. Now read the first component, and keep it in cache. Start reading the next component, and add the component to the cache where words occur more frequently, for rest of the words there will be cache-miss.

Solution 2 - Map reduce kind of way
Breaking the file into k , create a word to frequency mapping. Now merge the 2 files, and create a max heap, re-heapify as we merge next file with it.




0 comments:

Post a Comment