Wednesday, July 23, 2014

Shortest distances between two words in a file

Problem

You have a large text file containing words. Given any two words, find the shortest distance (in terms of number of words) between them in the file. Can you make the searching operation in O(1) time? What about the space complexity for your solution?
OR
Find the minimum distance between 2 numbers in the array?

Example

File contains:
as was is the as the yahoo you me was the and
Example 1
Input words: the, was
Output distance :

( as was is the as the yahoo you me was the and)
 

Example 2
Input words : you, the
Output distance: 2

(as was is the as the yahoo you me was the and)

Solution

Method 1 - Use 2 indices and current min
We have already solved this example here for array of integers, same logic has to be applied on array of words.
static int getMinDistance(String[] words, String word1, String word2) {
 int pos = 0;
 int min = Integer.MAX_VALUE / 2;
 int word1_pos = -min;
 int word2_pos = -min;
 for (int i = 0; i < words.length; i++) {
  String current_word = words[i];
  if (current_word.equals(word1)) {
   word1_pos = pos;  
   int distance = word1_pos - word2_pos;
   if (min > distance)
   min = distance;
  } else if (current_word.equals(word2)) {
   word2_pos = pos;
   int distance = word2_pos - word1_pos;
   if (min > distance) min = distance;
  }
  ++pos;
 }
 return min;
}

Time Complexity O(n)
Space Complexity O(1)

Method 2 - Using hashtable
To solve this problem in less time (but more space), we can create a hash table with each word and the locations where it occurs. We then just need to find the minimum (arithmetic) difference in the locations (e.g., abs(word0.loc[1] - word1.loc[5])).
  1. Pre-processing: Store the locations for different words in a hashtable. One scan of the text: O(n) time complexity. Approximately O(n) space complexity.
  2. Query: Modified binary search. For example, query is (hello, world). Lookup in the hashtable we have "hello" -> [1,2] and "world" -> [5,8,9]. We search 1 in [5,8,9] to find the nearest, which is 5. So the distance is 4. We search 2 in [5,8,9] to find the nearest, which is 5 again, yielding distance 3, less than 4. So the shortest distance between "hello" and "world" is 3. In practice, number of locations for a word is relatively small comparing to the size of the text, hence the cost of query/search is nearly constant.

So, if 2 numbers have to be changed again and again, method 1 is not that good, but method 2 will be better. Here is the code:
private static Map<String, ArrayList<Integer>> locations = new HashMap<String, ArrayList<Integer>>();
 
private static void storeLocations(String[] text) {
    for (int i = 0; i < text.length; ++i) {
        String word = text[i];
        if (locations.keySet().contains(word)) {
            ArrayList<Integer> location = locations.get(word);
            location.add(i);
            Collections.sort(location);
            locations.put(word, location);
        } else {
            ArrayList<Integer> location = new ArrayList<Integer>();
            location.add(i);
            locations.put(word, location);
        }
    }
}
 
private static int modified_binary_search(ArrayList<Integer> array,
        int target) {
    int low = 0;
    int high = array.size();
    while (low < high) {
        int mid = low + (high - low) / 2;
        if (target == array.get(mid))
            return target;
        if (target < array.get(mid))
            high = mid;
        else
            low = mid + 1;
    }
    if (low >= 0 && low < array.size())
        return array.get(low);
    else
        return array.get(array.size() - 1);
}
 
private static int shortest_distance(String a, String b) {
    int min = Integer.MAX_VALUE;
    for (int index_a : locations.get(a)) {
        ArrayList<Integer> array = locations.get(b);
        int nearest_index_b = modified_binary_search(array, index_a);
        int distance = Math.abs(nearest_index_b - index_a);
        if (distance < min)
            min = distance;
    }
    return min;
}

Time Complexity O(n)
Space Complexity O(n)


References

0 comments:

Post a Comment