Tuesday, November 5, 2013

Sequence Alignment using Dynamic Programming

Problem :
Input :
2 strings or sequence of characters i.e. X = x1x2...xm, Y = y1y2...yn

Goal - To define the similarity matrix between the 2 sequences with best alignment. This is based on penalty score, called Needleman/Wunsch score.

Eg. consider the sequence - AGGGCT and AGGCA , best alignment here is:
AGGGCT
 |  |  |      |
AGG - CA

Penalty
Penalty = penalty of match+penalty of difference + penalty of space

In the problem all the penalties are given.


For this example, the two sequences to be globally aligned are G A A T T C A G T T A (sequence #1)
G G A T C G A (sequence #2)
So M = 11 and N = 7 (the length of sequence #1 and sequence #2, respectively)
A simple scoring scheme is assumed where
  • Si,j = 1 if the residue at position i of sequence #1 is the same as the residue at position j of sequence #2 (match score); otherwise
  • Si,j = 0 (mismatch score)
  • w = 0 (gap penalty) 
Applying the DP
 Now lets apply the DP here. Suppose X is x1x2...xm, and Y=y1y2...yn. In this case, the last array position can have 3 possibilities (in top and bottom sequence respectively):
  1. xm and yn are aligned
  2. xm is aligned with space / gap
  3. space / gap  is aligned with yn
(Note that space cannot align with space, as it will be useless)

Optimal substructure
Let X' = X- {xm} , Y' = Y - {yn}
If case 1  holds, then induced alignment of X' and Y' as X' and Y' are of same size.
Similarly for case 2 and case 3.

Relevant sub problems 
Here sub problems get smaller, by plucking the characters from either 1st string or 2nd string or both. Hence we need to track how many characters we have plugged, and hence we need 2D array to keep track tof 2 things.


Optimal solution for penalty:
Let P denotes the Penalty here.
Lets consider the case 1 - xm and yn are aligned
Mij = Sij  + M(i-1)(j-1)

Case2 - xm is aligned with space / gap
Mij = w + M(i-1,j)

Case 3 - space / gap  is aligned with yn
Mij = w + M(i,j-1)

Mi,j = MAXIMUM[
     Mi-1, j-1 + Si,j (match/mismatch in the diagonal),
     Mi,j-1 + w (gap in sequence #1),
     Mi-1,j + w (gap in sequence #2)]
 
In case the penalties are such that we are given penalty Sij of mismatch is higher than penalty of match, then we have to use minimum function.
 
Three steps in dynamic programming
  1. Initialization
  2. Matrix fill (scoring)
  3. Traceback (alignment)
Initialization Step-edge cases
The first step in the global alignment dynamic programming approach is to create a matrix with M + 1 columns and N + 1 rows where M and N correspond to the size of the sequences to be aligned.
Since this example assumes there is no gap opening or gap extension penalty, the first row and first column of the matrix can be initially filled with 0.(Note in normal problem if Sgap is not 0, then use it some as i.gap)


Matrix Fill Step

One possible (inefficient) solution of the matrix fill step finds the maximum global alignment score by starting in the upper left hand corner in the matrix and finding the maximal score Mi,j for each position in the matrix. In order to find Mi,j for any i,j it is minimal to know the score for the matrix positions to the left, above and diagonal to i, j. In terms of matrix positions, it is necessary to know Mi-1,j, Mi,j-1 and Mi-1, j-1.
For each position, Mi,j is defined to be the maximum score at position i,j; i.e.
Mi,j = MAXIMUM[
     Mi-1, j-1 + Si,j (match/mismatch in the diagonal),
     Mi,j-1 + w (gap in sequence #1),
     Mi-1,j + w (gap in sequence #2)]
Note that in the example, Mi-1,j-1 will be red, Mi,j-1 will be green and Mi-1,j will be blue.
Using this information, the score at position 1,1 in the matrix can be calculated. Since the first residue in both sequences is a G, S1,1 = 1, and by the assumptions stated at the beginning, w = 0. Thus, M1,1 = MAX[M0,0 + 1, M1, 0 + 0, M0,1 + 0] = MAX [1, 0, 0] = 1.
A value of 1 is then placed in position 1,1 of the scoring matrix.
Since the gap penalty (w) is 0, the rest of row 1 and column 1 can be filled in with the value 1. Take the example of row 1. At column 2, the value is the max of 0 (for a mismatch), 0 (for a vertical gap) or 1 (horizontal gap). The rest of row 1 can be filled out similarly until we get to column 8. At this point, there is a G in both sequences (light blue). Thus, the value for the cell at row 1 column 8 is the maximum of 1 (for a match), 0 (for a vertical gap) or 1 (horizontal gap). The value will again be 1. The rest of row 1 and column 1 can be filled with 1 using the above reasoning.
Now let's look at column 2. The location at row 2 will be assigned the value of the maximum of 1(mismatch), 1(horizontal gap) or 1 (vertical gap). So its value is 1.
At the position column 2 row 3, there is an A in both sequences. Thus, its value will be the maximum of 2(match), 1 (horizontal gap), 1 (vertical gap) so its value is 2.
Moving along to position colum 2 row 4, its value will be the maximum of 1 (mismatch), 1 (horizontal gap), 2 (vertical gap) so its value is 2. Note that for all of the remaining positions except the last one in column 2, the choices for the value will be the exact same as in row 4 since there are no matches. The final row will contain the value 2 since it is the maximum of 2 (match), 1 (horizontal gap) and 2(vertical gap).
Using the same techniques as described for column 2, we can fill in column 3.
After filling in all of the values the score matrix is as follows:

Traceback Step

After the matrix fill step, the maximum alignment score for the two test sequences is 6. The traceback step determines the actual alignment(s) that result in the maximum score. Note that with a simple scoring algorithm such as one that is used here, there are likely to be multiple maximal alignments.
The traceback step begins in the M,J position in the matrix, i.e. the position that leads to the maximal score. In this case, there is a 6 in that location.


Traceback takes the current cell and looks to the neighbor cells that could be direct predacessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for traceback chooses as the next cell in the sequence one of the possible predacessors. In this case, the neighbors are marked in red. They are all also equal to 5.


Since the current cell has a value of 6 and the scores are 1 for a match and 0 for anything else, the only possible predacessor is the diagonal match/mismatch neighbor. If more than one possible predacessor exists, any can be chosen. This gives us a current alignment of

    (Seq #1)      A 
                  |
    (Seq #2)      A
So now we look at the current cell and determine which cell is its direct predacessor. In this case, it is the cell with the red 5.

 


The alignment as described in the above step adds a gap to sequence #2, so the current alignment is

    (Seq #1)     T A
                   |
    (Seq #2)     _ A
Once again, the direct predacessor produces a gap in sequence #2.
After this step, the current alignment is

      (Seq #1)     T T A
                       |
                   _ _ A 
Continuing on with the traceback step, we eventually get to a position in column 0 row 0 which tells us that traceback is completed. One possible maximum alignment is :
Giving an alignment of :
          G A A T T C A G T T A
          |   |   | |   |     | 
          G G A _ T C _ G _ _ A 
 
An alternate solution is:
Giving an alignment of :
          G _ A A T T C A G T T A
          |     |   | |   |     | 
          G G _ A _ T C _ G _ _ A

 
There are more alternative solutions each resulting in a maximal global alignment score of 6. Since this is an exponential problem, most dynamic programming algorithms will only print out a single solution.

Time complexity

O(m n)
m = size of sequence X, and n = size of Y

 

Monday, October 21, 2013

Application of clustering

Informal goal - given n points [web pages images, genome fragments, etc] classify into "coherent groups"
People in machine learning must have heard of unsupervised learning.
Assumption
1) As input, given a (dis) similarity measure
2 symmetric

examples - euclidean distance, genome similarity

goal - same clusters => nearby

So, coherent group has smaller distance.

Objective function to do clustering - there are many objective functions, but we are reading max-spacing k-clusterings. 

Max spacing  k-clustering
Assume we know  k = number of clusters desired
[In practice, can experiment with a range of values]

Call points p and q seperated if they are assigned to different clusters.

Problem statement - given a distance d and k, compute the k-clustering with maximum spacing.

greedy algorithm

pseudocode



Saturday, October 19, 2013

Minimum Spanning tree MST

MST is used to connect a bunch of points to each other as cheaply as possible.

Applications
  • clustering
  • networking

Blazingly fast greedy algorithms
There are many greedy algorithm, but we will talk about 2:
  • Prim's algo [1957, also djikstra' 1959] or Jarnic found it in 25 years earlier.
  • Kruskal's algo[1956]. Will be using union find data-structure.
They are blazingly fast as they are almost close to linear time of number of edges:
O( m log n) time m = edges, n=vertices OR better O (|E| log |V|)

Problem definition
Input - undirected graph G (V,E) .

We assume adjacency list representations
Ce - cost of edge e ∈ E
Note :
  1. Its ok if edge costs are negative (opposite to djikstra's algo)
  2. We are using graph undirected graph. For directed graph, this problem is called optimal branching. Those algo are out of scope of this course

Output - Min. cost tree T subset of E that spans all vertices.

What do we mean by cost here?
cost here means summing up of all the edge costs or weights.

What do we mean by spanning tree - aka difference between spanning tree and minimum spanning tree?
Spanning tree properties
A sub-graph T of a connected graph G(V,E) is called a Spanning Tree if
  • T has no cycles 
  • if T includes every vertex of G i.e. V(T)=V(G)
If |V|=n and |E|=m, then the spanning tree of G must have n vertices and hence n-1 edges.
The resultant spanning ensure that the graph remain connected and further there is no circuit in it.

Two algorithms for finding a spanning tree are BFS (Breadth First Search) and DFS (Depth First Search).

Minimum spanning tree
  minimum spanning tree (MST) or minimum weight spanning tree is then a spanning tree with weight less than or equal to the weight of every other spanning tree.

Two algorithms commonly used, Prim's algorithm and Kruskal's algorithm.


Assumptions made

  1. G cotnains path from any 1 node to other node.
  2. Edge costs are distinct (though this is not very important as both prims and kruskal's algorithms work with non distinct edges)
Algorithms to solve MST

Saturday, October 5, 2013

Binary Tree Level-Order Traversal Using Depth First Search (DFS) [Not to USE BFS]

Given a binary tree, print out the tree in level order (ie, from left to right, level by level). Output a newline after the end of each level. Breadth First Search (BFS) is not allowed. We have already seen how to do level order traversal here.

Example
So consider the tree:
      1
   /    \
  2      3
 /   \   /    \
4   5  6   7
The BFS or level order traversal here is :
1 2 3 4 5 6 7

Last time we used BFS, but this time it is not allowed.

Hint:
Write a function call printLevel(BinaryTree *p, int level) which will print all nodes of a given level. Assume you know the height of the tree, then you can print the entire tree level by level utilizing printLevel.

Solution:
printLevel function can be solved using DFS. Decrement level by one as you advance to the next level. When level equals 1, you’ve reached the given level. To find the maximum height (or the max depth) of a tree, you can read my post: Maximum Height or Depth of a Binary Tree.
void printLevel(BinaryTree *p, int level) {
  if (!p) return;
  if (level == 1) {
    cout << p->data << " ";
  } else {
    printLevel(p->left, level-1);
    printLevel(p->right, level-1);
  }
}
 
void printLevelOrder(BinaryTree *root) {
  int height = maxHeight(root);
  for (int level = 1; level <= height; level++) {
    printLevel(root, level);
    cout << endl;
  }
}

Further Thoughts:
If you look carefully, you will notice that the DFS solution traverses the same node multiple times. Since BFS traverses each node exactly one time, BFS is much more efficient than DFS.
Could you find the run time complexity for the DFS level-order traversal solution? Try to estimate as best as you can, and then find the correct answer by proving it using Math. Does your estimate fares well with the correct answer? Why?
Answer:
Although the DFS solution traverse the same node multiple times, it is not another order slower than the BFS solution. Here is the proof that the DFS solution above runs in O(N) time, where N is the number of nodes in the binary tree and we assume that the binary tree is balanced.

We first compute the complexity of printLevel for the kth level:
T(k) = 2T(k-1) + c
     = 2k-1 T(1) + c
     = 2k-1 + c

Assuming it’s a balanced binary tree, then it would have a total of lg N levels.
Therefore, the complexity of printing all levels is:
T(1) + T(2) + ... + T(lg N)
= 1 + 2 + 22 + ... + 2lg N-1 + c
= O(N)

Finding the maximum height of the tree also takes O(N) time, therefore the overall complexity is still O(N).

Thanks.

Binary Tree Post-Order Traversal - Recursive and Iterative Solution

Consider the tree:



To traverse a binary tree in Postorder, following operations are carried-out (i) Traverse all the left external nodes starting with the left most subtree which is then followed by bubble-up all the internal nodes, (ii) Traverse the right subtree starting at the left external node which is then followed by bubble-up all the internal nodes, and (iii) Visit the root.
Therefore, the Postorder traversal of the above tree will outputs:
0, 2, 4, 6, 5, 3, 1, 8, 10, 9, 7

Recursive solution

postorder(Node *root)
{
  if(root)
  {
    postorder(root->left);
    postorder(root->right);
    printf("Value : [%d]", root->value);
  }
}

Iterative solution
Iterative solution involves 2 stacks. So if we want to do the same thing iteratively; we just need to maintain 2 stacks child and parent. Following is the algorithm of this method:

  • Push the root node to the child stack.
  • while child stack is not empty
  • Pop a node from the child stack, and push it to the parent stack.
  • Push its left child followed by its right child to the child stack.
  • end while
  • Now the parent stack would have all the nodes ready to be traversed in post-order. Pop off the nodes from the parent stack one by one and you will have the post order traversal of the tree.

Here is the code for the same:
void postOrderTraversalIterativeTwoStacks(TreeNode *root)  
    {  
        if (!root) return;  
        stack<binarytree*> child;  
        stack<binarytree*> parent;  
          
        //Initialization  
        child.push(root);  
          
        while (!child.empty()) {  
            BinaryTree *curr = child.top();  
            parent.push(curr);  
            child.pop();  
            if (curr->left)  
                child.push(curr->left);  
            if (curr->right)  
                child.push(curr->right);  
        }  
          
        //Printing the post order traversal      
       while (!parent.empty()) {          
            cout << parent.top()->data << " ";      
            parent.pop();  
        }  
    }  

Dry running the code
Lets see a light example of how it works. Suppose we have a binary tree as shown below and we need to compute its post order traversal. It's post order traversal will be
{A, C, E, D, B, H, I, G, F}


 Here is how the stack grows:

 Iterative solution using arrays
void iterativePostorder(node *root) {
  struct   { 
    node *node; 
    unsigned vleft :1;   // Visited left?
    unsigned vright :1;  // Visited right?  
 }save[100];   

int top = 0;   
save[top++].node = root;   

 while ( top != 0 )   {       
/* Move to the left subtree if present and not visited */  
     if(root->left != NULL && !save[top].vleft)       { 
          save[top].vleft = 1; 
          save[top++].node = root;  
          root = root->left;   
         continue; 
      }       /* Move to the right subtree if present and not visited */ 
      if(root->right != NULL && !save[top].vright )       { 
          save[top].vright = 1; 
          save[top++].node = root; 
          root = root->right; 
          continue; 
      } 
      printf("[%d] ", root->value); 
      /* Clean up the stack */ 
      save[top].vleft = 0;  
      save[top].vright = 0; 
      /* Move up */   
     root = save[--top].node;    
 } 
}

http://leetcode.com/2010/10/binary-tree-post-order-traversal.html

Binary Tree In-Order Traversal - Recursive and Iterative Solution


Consider the tree
Inorder traversal prints the binary tree in increasing order in case of Binary Search Tree, but here we are discussing any binary tree.

To traverse a binary tree in Inorder, following operations are carried-out :
  1. Traverse the left most subtree starting at the left external node, 
  2. Visit the root, and
  3. Traverse the right subtree starting at the left external node.
Therefore, the Inorder traversal of the above tree will outputs:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10



Solutions

There are 3 solutions to achieve this. One is iterative and other is recursive. As we move from recursive to iterative, stack is the obvious choice. But we also have Morris Traversal.

Method 1 - Recursive solution

inorder(Node *root)
{
  if(root)
  {
    inorder(root->left);
    printf("Value : [%d]", root->value);
    inorder(root->right);
  }
}


Method 2 - Iterative solution using Stack

Using Stack is the obvious way to traverse tree without recursion. Below is an algorithm for traversing binary tree using stack. See this for step wise step execution of the algorithm.
1) Create an empty stack S.
2) Initialize current node as root
3) Push the current node to S and set current = current.left until current is NULL
4) If current is NULL and stack is not empty then
     a) Pop the top item from stack.
     b) Print the popped item, set current = current.right
     c) Go to step 3.
5) If current is NULL and stack is empty then we are done.

Here is the code in java:
public static <T> void InorderTraversalIterativeWithStack(
  BinaryTreeNode<T> root) {
 Stack<BinaryTreeNode<T>> s = new Stack<BinaryTreeNode<T>>();

 BinaryTreeNode<T> current = root;
 while (current != null) {
  s.add(current);
  current = current.left;
 }

 while (!s.isEmpty() && current == null) {
  current = s.pop();
  out.print(current.data+ " ");
  
  current = current.right;
  while (current != null) {
   s.add(current);
   current = current.left;
  }
 }

}



Let us consider the below tree for example
            1
          /   \
        2      3
      /  \
    4     5

Step 1 Creates an empty stack: S = NULL

Step 2 sets current as address of root: current -> 1

Step 3 Pushes the current node and set current = current->left until current is NULL
     current -> 1
     push 1: Stack S -> 1
     current -> 2
     push 2: Stack S -> 2, 1
     current -> 4
     push 4: Stack S -> 4, 2, 1
     current = NULL

Step 4 pops from S
     a) Pop 4: Stack S -> 2, 1
     b) print "4"
     c) current = NULL /*right of 4 */ and go to step 3
Since current is NULL step 3 doesn't do anything.

Step 4 pops again.
     a) Pop 2: Stack S -> 1
     b) print "2"
     c) current -> 5/*right of 2 */ and go to step 3

Step 3 pushes 5 to stack and makes current NULL
     Stack S -> 5, 1
     current = NULL

Step 4 pops from S
     a) Pop 5: Stack S -> 1
     b) print "5"
     c) current = NULL /*right of 5 */ and go to step 3
Since current is NULL step 3 doesn't do anything

Step 4 pops again.
     a) Pop 1: Stack S -> NULL
     b) print "1"
     c) current -> 3 /*right of 5 */ 

Step 3 pushes 3 to stack and makes current NULL
     Stack S -> 3
     current = NULL

Step 4 pops from S
     a) Pop 3: Stack S -> NULL
     b) print "3"
     c) current = NULL /*right of 3 */ 

Now what if we can't use extra space. Then, we use Morris traversal.

Method 3 - Using Morris traversal

For BST traversal, recursion is a natural way of thought. But sometimes interviewer want to know the iterative way of traversal.
Inorder traversal without recursion:
Using Morris Traversal, we can traverse the tree without using stack and recursion. The idea of Morris Traversal is based on Threaded Binary Tree. In this traversal, we first create links to Inorder successor and print the data using these links, and finally revert the changes to restore original tree.
1. Initialize current as root
2. While current is not NULL
   If current does not have left child
      a) Print current’s data
      b) Go to the right, i.e., current = current->right
   Else
      a) Make current as right child of the rightmost node in current's left subtree
      b) Go to this left child, i.e., current = current->left


Using Morris Traversal, we can traverse the tree without using stack and recursion. The idea of Morris Traversal is based on Threaded Binary Tree. In this traversal, we first create links to Inorder successor and print the data using these links, and finally revert the changes to restore original tree.
1. Initialize current as root
2. While current is not NULL
If current does not have left child
a) Print current’s data
b) Go to the right, i.e., current = current->right
Else
a) Make current as right child of the rightmost node in current's 
   left subtree
b) Go to this left child, i.e., current = current->left

Here is the code in java
//MorrisTraversal
public static <T> void InorderTraversalIterativeWithoutStack(
  BinaryTreeNode<T> root) {
 BinaryTreeNode<T> current, pre;
 if (root == null)
  return;
 current = root;
 while (current != null) {
  if (current.left == null) {
   System.out.print(current.data + " ");
   current = current.right;
  } else {
   pre = current.left;
   while (pre.right != null && pre.right != current)
    pre = pre.right;
   if (pre.right == null) {
    pre.right = current;
    current = current.left;
   } else {
    pre.right = null;
    System.out.print(current.data + " ");
    current = current.right;
   }
  }
 }
}

Here is the tree after this algo(taken from http://articles.leetcode.com/2010/04/binary-search-tree-in-order-traversal.html):


Comparison between Method 2 and 3


Compared to stack based traversal this takes no extra space. Though tree is modified during traversal, it is reverted back to original.

References
http://leetcode.com/2010/04/binary-search-tree-in-order-traversal.html
http://www.geeksforgeeks.org/inorder-tree-traversal-without-recursion/

Find the rank w.r.t K - Number of nodes with value less than K

If the rank of the BST is k, it implies how many nodes/keys are less than k.

So, it boils down to 3 easy recursive calls
  • In the simplest case if K==node value, then whole of the let is rank of the node
  • if K < node.value then we know that rank depends on the size of left sub tree and its less than sub tree's length
  • If K > node.value then we know that we have to count full left subtree w.r.t with that node, and some nodes in right

public int rank(int K, node x)
{
   if(x==null) return 0;
   if(K < x.data)   return rank(K,x.left);
   else if(K > x.data)   return 1 + size(x.left) + rank(K,x.right);
   else if(K == x.data)   return size(x.left);   
}