Thursday, February 17, 2011

Anagram Trees : Part 2

One nice thing about working at Google is that you are surrounded by very smart people. I told one of my coworkers about the anagram tree idea, and he immediately pointed out that reordering the alphabet so that the least frequently used letters come first would reduce the branching factor early in the tree, which has the effect of reducing the overall size of the tree substantially. While this seems obvious in retrospect, it's kind of unintuitive - usually we try to _increase_ the branching factor of n-ary trees to make them shallower and require fewer operations, rather than trying to reduce it.

Trying it out with an ordering determined by looking at the branching factor for each letter produces results that bear this out: Memory is reduced by about a third, and the number of internal nodes is reduced to 858,858 from 1,874,748, a reduction of more than 50! Though I haven't benchmarked it, difficult lookups are substantially faster, too.

The next logical development to try is to re-evaluate the order of the alphabet on a branch-by-branch basis. While I doubt this will have a substantial impact, it seems worth a try, so I'll give it a go and update with results.

Edit: Re-evaluating the symbol to choose on a branch-by-branch basis had a bigger impact than I anticipated: The tree created with my sample dictionary now has a mere 661,659 internal nodes. Here's the procedure for creating a tree using this method:

Assuming you have:
  • A dictionary
  • A set of symbols that have not yet been used (initially set to the alphabet)
  1. If the symbol set is empty, this is a leaf node - store the dictionary in the node and return.
  2. Find the symbol from the set that, if used, will result in the smallest number of branches (that is, the symbol that has the least variation in number of occurrences).
  3. Mark the current node with the chosen symbol
  4. Partition the dictionary into sub-dictionaries based on how many occurrences of the chosen symbol they have
  5. For each sub-dictionary, recurse with the sub-dictionary and the set less the symbol you selected.
Implemented in Python, this is actually substantially larger in memory and on disk than the previous approach, likely due to overhead with using classes instead of tuples as the nodes. In statically-typed languages, however, the overhead should be substantially outweighed by the benefit of the reduction in node count.

Note that the result of this alternate method is that while the letter to branch on is different for every node, following nodes from any leaf to the root of the tree always results in a valid permutation of the alphabet used.

Edit 2: The code for a Python implementation incorporating these ideas can be found here

0 comments:

Post a Comment