Thursday, February 17, 2011

Anagram Trees

When it comes to finding anagrams of words, a frequent approach is to use an anagram dictionary - simply put, sort the letters in your word to provide a unique key that all anagrams of a word have in common. Another approach is to generate a letter-frequency histogram for each letter in your word. (Both these approaches are more or less equivalent, in fact.) These approaches make the problem of finding exact single-word anagrams for strings very efficient - O(1) if you use a hashtable.

However, the problem of finding subset anagrams - a word that contains a subset of the letters in a string - is still rather inefficient, requiring either a brute force O(n) search through the dictionary, or looking up every substring of the sorted input string, which is O(2^l) with the number of letters in the input string. Finding subset anagrams is significantly more interesting, too, as it has applications in finding multi-word anagrams, as well as being applicable to problem domains such as scrabble.

However, with a little more effort, and the above observation that we can generate a histogram that uniquely represents a given set of letters, we can generate a tree structure that makes looking up subset anagrams much more efficient. To build the tree, we follow this simple procedure:

Assume we have the following information:
  • A lexicon or dictionary of words to populate the tree with
  • An alphabet for words in the lexicon
  • The tree we are building
  • A current node
For each term in the lexicon:
  1. Generate a letter-frequency histogram for the term.
  2. Set the current node to the root of the tree.
  3. For each symbol in the alphabet:
    1. Get the frequency of the current symbol in the current term. Call it f
    2. Set the current node to the fth child node of the current node, creating it if it doesn't exist
  4. Append the current term to the list of words on the current (leaf) node

The result of following this simple procedure is a fixed-height tree, 27 nodes deep, with all the words in the leaf nodes, and each internal tier of the tree corresponding to a symbol from the alphabet. Here's an (abbreviated) example:


Once the tree is built, we find subset anagrams for an input string as follows:

Assume we have the following information:
  • The tree we built using the above procedure.
  • The alphabet we used above.
  • A frontier set, initially empty.
  1. Initialize the frontier set to contain the root of the tree.
  2. Generate a letter-frequency histogram for the input string.
  3. For each symbol in the alphabet:
    1. Get the frequency of the current symbol in the input string. Call it f.
    2. For each node in the current frontier set, add the subnodes numbered 0 through f to the new frontier set.
  4. The frontier set now consists of leaf nodes, containing all the subset anagrams of the input string.

Runtime analysis of this algorithm is rather difficult, for me, at least. Intuitively and in practice, it's a lot faster than either of the brute-force approaches, but quantifying that in big-O notation is something that's escaped me. As an upper bound, it cannot be less efficient than O(n) - only a constant factor worse than the brute-force approach. As a lower bound, a lookup in which the frontier set always has one node, lookup time is proportional to the length of the alphabet, or O(1). The average case depends on how large a subset of the dictionary the input string selects. Quantifying by the size of the output, approximately O(m) operations are required. If anyone knows how to determine more solid bounds for runtime, please do let me know in the comments.

One disadvantage of this approach is that there is substantial memory overhead. Using my Python implementation of the algorithm, and importing /usr/share/dict/words, which is approximately 2MB on this machine results in over 300MB of memory consumed. Using the Pickle module to serialize to disk, the output file is over 30MB, and compresses with gzip down to about 7MB. I suspect part of the large memory overhead is due to the minimum size of Python's dictionaries; I will modify the implementation to use lists and update this post if I can make it more efficient.

Here's a few stats on the tree generated that may be of interest:
Total words: 234,936
Leaf nodes: 215,366
Internal nodes: 1,874,748

From this we can see that the average cardinality of internal nodes is very low - not much more than 1. A breakdown of the number of nodes in each tier helps clarify this:
TierNumber of nodes
0 1
1 7
2 25
3 85
4 203
5 707
6 1145
7 1886
8 3479
9 8156
10 8853
11 10835
12 19632
13 28470
14 47635
15 73424
16 92618
17 94770
18 125018
19 156406
20 182305
21 195484
22 200031
23 203923
24 205649
25 214001

The cardinality of nodes towards the top of the tree is fairly high, but the tree quickly flattens out, and the last four tiers of the tree account for almost half of the total nodes. This suggests one possible space optimisation: Removing the last few tiers of the tree and concatenating their leaf nodes together. When performing lookups, check the selected nodes to ensure they are actually subset anagrams of the input string.

* It's possible I'm simply rediscovering something that's well known in the computer science community, or perhaps mentioned in a computer science paper 30 years ago. Significant searching hasn't turned up anyone using an algorithm like this, or anything else more efficient than the brute-force approaches outlined.

Edit: The source to my initial implementation is here.

Edit: Converting my Python implementation to use lists reduced memory consumption by roughly half. I'll post figures for the pickled tree and the source code when I have the opportunity.

Edit: More updates can be found here.

0 comments:

Post a Comment