Friday, March 28, 2014

Sort an array of strings so that anagrams are next to each other

Problem
Write a method to sort an array of strings so that all the anagrams are next to each other.
Example
INPUT : "xyz", "ca", "ab", "ac", "ba", "zyx" 
OUTPUT: "ab", "ba", "ac", "ca", "xyz", "zyx"

Lets see the solutions now.
Solution

Method 1 - Using bubble sort
Check if two pairs of strings are anagrams or not, if yes, swap.

Java code
private static boolean areAnagrams(String s1, String s2) {
    if (s1.length() != s2.length() || s1 == null || s2 == null)
        return false;
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();
    int[] table = new int[26];
    for (int i = 0; i < s1.length(); ++i) {
        int index = s1.charAt(i) - 'a';
        table[index]++;
    }
    for (int i = 0; i < s2.length(); ++i) {
        int index = s2.charAt(i) - 'a';
        if (table[index] <= 0)
            return false;
        table[index]--;
    }
    for (int i = 0; i < 26; ++i) {
        if (table[i] > 0)
            return false;
    }
    return true;
}
 
private static void swap(String[] strings, int i, int j) {
    String temp = strings[i];
    strings[i] = strings[j];
    strings[j] = temp;
}
 
public static void sortAnagrams(String[] strings) {
    for (int i = 0; i < strings.length - 1; ++i) {
        for (int j = i + 1; j < strings.length; ++j) {
            if (areAnagrams(strings[i], strings[j]))
                swap(strings, i + 1, j);
        }
    }
}

Here sort anagrams is the main method.
Time complexity - O(n^2).

Method 2 - Use hashtable
Calculate the hash value of each word in such a way that all anagrams have the same hash value. Populate the Hash Table with these hash values. Finally, print those words together with same hash values. A simple hashing mechanism can be modulo sum of all characters. With modulo sum, two non-anagram words may have same hash value. This can be handled by matching individual characters

Method 3 - Sort the individual words and compare
Take two auxiliary arrays, index array and word array. Populate the word array with the given sequence of words. Sort each individual word of the word array. Finally, sort the word array and keep track of the corresponding indices. After sorting, all the anagrams cluster together. Use the index array to print the strings from the original array of strings.

Let us understand the steps with following input Sequence of Words:
"cat", "dog", "tac", "god", "act"
1) Create two auxiliary arrays index[] and words[]. Copy all given words to words[] and store the original indexes in index[]
index[]:  0   1   2   3   4
words[]: cat dog tac god act
2) Sort individual words in words[]. Index array doesn’t change.
index[]:   0    1    2    3    4
words[]:  act  dgo  act  dgo  act
3) Sort the words array. Compare individual words using strcmp() to sort
index:     0    2    4    1    3
words[]:  act  act  act  dgo  dgo
4) All anagrams come together. But words are changed in words array. To print the original words, take index from the index array and use it in the original array. We get
"cat tac act dog god"
Following is C implementation of the above algorithm. In the following program, an array of structure “Word” is used to store both index and word arrays. DupArray is another structure that stores array of structure “Word”.

So, though we are using indexing based on array indices, but we can achieve same thing via hashmap(pseudocode):

    Loop through the array of strings
    For each string,
        first sort its characters,
        using the sorted string as key and original string as value,
        put into a hash table. 
        you will end up with a hash table that with keys as sorted strings,
            and values being all anagrams, meanwhile, those values are ordered.
        You may use map<string, set<string> > to serve for this purpose.
    iterate over the hash-table and output all anagrams together for a given key, 
        they should be next to each other


Time Complexity: Let there be N words and each word has maximum M characters. The upper bound is O(NMLogM + MNLogN).

Method 4 - Non-agnostic solution in java
A simple solution in java is to use Comparable or Comparator interface. Here is the solution:
public class AnagramComparator implements Comparator<String> {
    public String sortChars(String s) {
        char[] content = s.toCharArray();
        Arrays.sort(content);
        return new String(content);
    }
 
    public int compare(String s1, String s2) {
        return sortChars(s1).compareTo(sortChars(s2));
    }
}
 
public static void main(String[] args) {
    String[] strings = { "xyz", "ca", "ab", "ac", "ba", "zyx" };
    AnagramComparator comparator = new AnagramComparator();
    Arrays.sort(strings, comparator);
}

References 
http://tianrunhe.wordpress.com/2012/03/28/sort-an-array-of-strings-so-that-anagrams-are-next-to-each-other/
http://stackoverflow.com/questions/15515324/how-to-sort-an-array-of-strings-with-anagrams-next-to-each-other
http://www.geeksforgeeks.org/given-a-sequence-of-words-print-all-anagrams-together/


0 comments:

Post a Comment