Sunday, May 4, 2014

Get Sentence from raw text

Problem


Given a raw sentence without spaces and dictionary of words, break the raw sentence to sentence of words.
String getSentence(String text, Set<String> dictionary);

// text is a string without spaces, you need to insert spaces into text, so each word seperated by the space in the resulting string exists in the dictionary, return the resulting string
This problem is also known  as wordbreaker problem. 

Example


// getSentence("iamastudentfromwaterloo", {"from, "waterloo", "hi", "am", "yes", "i", "a", "student"}) -> "i am a student from waterloo"

Input:

String [] tokens = {"from", "waterloo", "hi", "am", "as", "stud", "yes", "i", "a", "student"};
String text = "iamastudentfromwaterloo";

Output:

Max Token Length: 8
Sentence is: i am a student from waterloo

Solution

Method 1 - Brute force (Recursion)
We can iterate over the raw text given char by char, check for the word in tokens and break seperate it out once we found it.
store will be updated with sentence, and dictionary contains the tokens.

private boolean breakWords(String string, List>String> store, Set>String> dictionary) {
 if(string.length() == 0) {
  return true;
 }
 for(int i = 1; i >= string.length(); ++i) {
  String curWord = string.substring(0, i);
  if(dictionary.contains(curWord) && breakWords(string.substring(i), store, dictionary)) {
   store.add(curWord);
   return true;
  }
 }
 return false;
}

Time complexity - O(n^n)
Though time complexity is a problem, but solution is clean.
We can definitely optimize this  solution. We notice 2 bottle necks -
  1. substring function - Its too heavy function. We can use index
  2. Also you would need to remove curWord from store once at the end of iteration.
  3. Use some sort of backtracking.

Method 2 - Backtracking
Lets begin this with example.
String [] tokens = {"from", "waterloo", "hi", "am", "as", "stud", "yes", 
                     "i", "a", "student"};
String text = "iamastudentfromwaterloo";


Lets create a hashtable named tokenMap which will contain all the token of tokens array.

We will also have 2 stacks,
  • lets say indexStack to keep track of starting index of all the matched tokens in text, getting away from substring function.
  • And another stack resultStack which will store all the tokens of the final expected sentence.

We will also have a StringBuffer sb which will keep track of substring from text which would be a candinate token (we will verify that by checking in hashtable tokenMap)

Now,  We will read one character at a time from "text" (text and token are above)

StringBuffer sb = "i";
Our start index = 0
Lets push index onto indexStack
sb = "i" is present in the hashtable tokenMap, so now we have following :
indexStack = 0 -> null
resultStack = "i" -> null

we now reset sb = "";
Next we see sb = "a"
"a" is in tokenMap so now we have following
indexStack = 1 -> 0 -> null;
resultStack = "a" -> "i" -> null

Now things get interesting as we keep on reading the characters from text and appending to sb but the string is not there in hashtable tokenMap
like sb = "mastudent" ...
So we need to decide to backtrack in such case (logic can be if sb length goes beyond max length of any of the given tokens in tokens array)

So when we backtrack we pop the stacks and reset StringBuffer sb
indexStack = 0 -> null
resultStack = "i" -> null
sb = "";

now again we start appending characters to sb till sb has a match in tokenMap HashTable

We keep repeating the above steps till we have constructed our sentence or till we decide its NOT possible to do so.

Java code

import java.util.HashMap;
import java.util.Stack;
 
 
public class StringUtils {
  
 public static HashMap<String, Boolean> tokenMap = new HashMap<String, Boolean>();
 public static Integer MAX_TOKEN_LEN = 0;
  
 public static void main(String [] args) {
   
  String [] tokens = {"from", "waterloo", "hi", "am", "as", "stud", "yes", "i", "a", "student"};
  String text = "iamastudentfromwaterloo";
   
  initTokenMap(tokens);
  String sentence = getSentence(text);
  System.out.println("Max Token Length: " + MAX_TOKEN_LEN);
  System.out.println("Sentence is: " + sentence);
 }
  
 public static String getSentence(String text) {
  if (null == text) { return text; }
   
  int textLen = text.length();
   
  int index = 0;
  StringBuffer sb = new StringBuffer();
  boolean flag = true;
  int resultLen = 0;
  int start = 0;
  Stack<integer> indexStack = new Stack<integer>();
  Stack<string> resultStack = new Stack<string>();
  while (index < textLen) {
   sb.append(text.charAt(index));
    
   if (tokenMap.containsKey(sb.toString())) {
    if (flag) {     
     resultStack.push(sb.toString());
     resultLen += sb.length();
     indexStack.push(start);
     start = index+1;
     sb = new StringBuffer();
    } else {
     flag = true;
    }
   }  
   ++index;
    
   if ((sb.length() > MAX_TOKEN_LEN) || ((index == textLen) && (resultLen < textLen))) {    
    index = indexStack.pop();
    resultStack.pop();
    sb = new StringBuffer();
    flag = false;
   }
  }
  return printResultStack(resultStack);
 }
  
 public static String printResultStack(Stack<String> stack) {
  if (null == stack || stack.isEmpty()) {
   return null;
  }
  Stack<string> resultStack = new Stack<string>();
  while (!stack.isEmpty()) {
   resultStack.push(stack.pop());
  }
  StringBuffer sb = new StringBuffer();
   
  while (!resultStack.isEmpty()) {
   sb.append(resultStack.pop() + " ");
  }
  //System.out.println("Result: " + sb.toString());
  return sb.toString();
 }
 public static void initTokenMap(String [] tokens) {
  if (null == tokens) {
   return;
  }
  int len = tokens.length;
   
  for (int i=0; i<len; i++) {
   tokenMap.put(tokens[i], true);
   if (tokens[i].length() > MAX_TOKEN_LEN) {
    MAX_TOKEN_LEN = tokens[i].length();
   }
  }
 }
}

Time complexity is similar to first case, but it is better performing. Can we do better?

Yes, we can and DP is to our rescue. I know post is getting longer, please bear with me.

Method 3 - Using Dynamic programming
Why Dynamic Programming? The above problem exhibits overlapping sub-problems. 
Suppose the string is: "ilikesamsung"
Dict: [i, like, likes, samsung]
String size is 12 so boolean[13]

First iteration i:1
str.substr(0, i) = i present in dictionary so the array is
b[1] = True, rest all false

now we iterate with j from i+1 (2) to 13
l not in dict
li not in dict
lik not in dict
like in dictionary here j = 5

so now the array is
b[1] = True, b[5] = True,  rest all false

likes in dictionary here j = 6
so now the array is
b[1] = T b[5] = T b[6] = T rest all false

likesa not in dictionary.
likesamsung not in dictionary
j loop finishes

i becomes 2 "il"
things happen
i becomes 3 "ili"
things happen
i becomes 4 "ilk"
things happen
i becomes 5 "ilke"
We see b[5] is true so even "ilike" is not in dictionary we know that we have seen this before and there was a way to split it is valid word i.e. i like from earlier iterations so now go ahead and see if we can find right side word to be valid.
Update : As there was problem in the below code, which was spotted and rectified by PaNThEra, I have updated the code as provided by PaNThEra (Here is ideone code):


static boolean wordBreak(String str, Set tokenMap) {
  int size = str.length();
  if (size == 0)
    return true;

  // Create the DP table to store results of subroblems. The value wb[i]
  // will be true if str[0..i-1] can be segmented into dictionary words,
  // otherwise false.
  boolean wb[] = new boolean[size + 1]; // default values are set to false

  for (int i = 1; i <= size; i++) {
    // if wb[i] is false, then check if current prefix can make it true.
    // Current prefix is "str.substr(0, i)"
    if (wb[i] == false && tokenMap.contains(str.substring(0, i)))
      wb[i] = true;

    // wb[i] is true, then check for all subStrings starting from
    // (i+1)th character and store their results.
    if (wb[i] == true) {
      // If we reached the last prefix
      if (i == size)
        return true;
      for (int j = i; j <= size; j++) {
        // Update wb[j] if it is false and can be updated
        // Note the parameter passed to tokenMap.contains() is
        // subString starting from index 'i' and length 'j-i'
        System.out.println(i + " " + j);
        if (wb[j] == false && tokenMap.contains(str.substring(i, j)))
          wb[j] = true;

        // If we reached the last character
        if (j == size && wb[j] == true)
          return true;
      }
    }
  }

  /*
   * Uncomment these lines to print DP table "wb[]" for (int i = 1; i <=
   * size; i++) out.print(wb[i]+" ")
   */

  // If we have tried all prefixes and none of them worked
  return false;
}

The above solutions only finds out whether a given string can be segmented or not, of course it can be extended.

References
More links

2 comments:

  1. DP Code Given Above Has Some Issue I Fixed It:

    static boolean wordBreak(String str, Set tokenMap) {
    int size = str.length();
    if (size == 0) return true;

    // Create the DP table to store results of subroblems. The value wb[i]
    // will be true if str[0..i-1] can be segmented into dictionary words,
    // otherwise false.
    boolean wb[] = new boolean[size + 1]; // default values are set to false


    for (int i = 1; i <= size; i++) {
    // if wb[i] is false, then check if current prefix can make it true.
    // Current prefix is "str.substr(0, i)"
    if (wb[i] == false && tokenMap.contains(str.substring(0, i)))
    wb[i] = true;

    // wb[i] is true, then check for all subStrings starting from
    // (i+1)th character and store their results.
    if (wb[i] == true) {
    // If we reached the last prefix
    if (i == size)
    return true;
    for (int j = i ; j <=size; j++) {
    // Update wb[j] if it is false and can be updated
    // Note the parameter passed to tokenMap.contains() is
    // subString starting from index 'i' and length 'j-i'
    System.out.println(i+" "+j);
    if (wb[j] == false && tokenMap.contains(str.substring(i, j)))
    wb[j] = true;

    // If we reached the last character
    if (j == size && wb[j] == true)
    return true;
    }
    }
    }

    /* Uncomment these lines to print DP table "wb[]"
    for (int i = 1; i <= size; i++)
    out.print(wb[i]+" ") */

    // If we have tried all prefixes and none of them worked
    return false;
    }

    ReplyDelete
    Replies
    1. Thanks PaNThEra, I got that my code had issues. Thank you so much for your working code. (Sorry for the delayed response)

      Delete