Saturday, August 10, 2013

Given a dictionary of word - Group all words into set of anagrams of each other

Given a set of words in a dictionary, write a program to group all words which are anagrams of each other in to sets.

input: “bat”, “tar”, “xyz” , “tab”, “tar”     
output:[["bat", tab"], ["tar", rat"],”xyz” ]

Anagrams
Two words are anagrams if one of them has exactly same characters as that of the another word.

Example : Anagram & Nagaram are anagrams (case-insensitive).

Approach
The simpler logic would be to :
  1. Use some sort of a hash structure - key being string, and value may be list of words
  2. Sort all words alphabetically and use the sorted word as the key, the value being the non sorted word. So bat will become abt, and it is the key in the dictionary and values will be bat and tab.
  3. If the key is found, keep on adding the original word to the ‘value’ list of strings
Something like this :
Solution 1 : Using hash table
def compare_sort(x):
        sorted_key_list = list(x)
        sorted_key_list.sort()
        sorted_str = ','.join(str(n) for n in sorted_key_list)
        sorted_str = sorted_str.replace(',','')
        return sorted_str

source = ['tab','bat','atb','tac','cat','xyz']

target_hash = {}

for x in source:
        sorted_key = compare_sort(x)
        print sorted_key

        if target_hash.has_key(sorted_key):
                existing_list = target_hash[sorted_key]
                existing_list.append(x)
                target_hash[sorted_key] = existing_list
        else:
                target_list =[x]
                target_hash[sorted_key] = target_list

print target_hash     

Complexity: number of words in dictionary
space complexity : hash table entries

Solution 2 - Assigning characters with prime numbers

The obvious solution is to map each character to a prime number and multiply the prime numbers. So if 'a'' -> 2 and 'b' -> 3, then
  • 'ab' -> 6
  • 'ba' -> 6
  • 'bab' -> 18
  • 'abba' -> 36
  • 'baba' -> 36
To minimise the chance of overflow, the smallest primes could be assigned to the more frequent letters (e,t,i,a,n). Note: The 26th prime is 101.So, for all the words where the integer comes out to be same, you have an anagram. Here is more you can read on this.


Reactions:

0 comments:

Post a Comment