## Monday, July 5, 2010

### Kruskal's Algorithm

In kruskal's algorithm the selection function chooses edges in increasing order of length without worrying too much about their connection to previously chosen edges, except that never to form a cycle. The result is a forest of trees that grows until all the trees in a forest (all the components) merge in a single tree. graph

Example

In Kruskal's algorithm, we sort the edges according to their edge weights and start counting them and including them in MST, if they are not forming any cycle.
Consider the graph:

So, we begin with the edge with lowest edge cost or weight, and change its color, making it part of spanning tree:

Likewise we add edge 2, and see if there is no cycle, then we are fine to go.

Now we add edge 3 and see if there is cycle, as there is no cycle, we are good to go.

Now we add edge 4, as it is next in sorted list of edges, but now the issue is it is making cycle, so we ignore it.

Now we add edge 5, and it is good to go, and is part of MST

Now we know that all the vertices are covered and hence it is the MST now.

Now lets see the pseudocode.
Pseudocode

```Sort the edges in order of increasing cost
[Rename the edges 1,2,3,....,m so that c1 < c2 < c3 ....< cm]
T = {}

for i=1 to m {
if ( T[i] has no cycles)

return T
```

Running time of straightforward implementation
Here is the running time  :
• Sorting of edges - O (m log n) [Actually sorting takes O(m log m), but m is nearly O(n2) ]
• Checking if the cycle exists, is actually checking if there is already a path between vertices u and v, which are end points of the new edge. If there is already exists, this will create a cycle. But how do we check if there is path between u and v - answer is BFS or DFS (breadth first search or depth first search). This will take max of O(n-1) vertices scan, and hence O(mn)
• So total running time = O(m log n) + O(mn)

So, here the bummer is checking the cycle which is taking long time, and union find datastructure can help us solving this where it will return the cycle in constant time.

Implementing Kruskal Algorithm via Union-Find
Raise detre of union find data structure - Is to maintain partition of a set of objects.
Operation supported - Union and Find
Find(x) - return the name of group that x belongs to? eg. C3
Union (Ci Cj) - fuse group Ci and Cj into a single one.

Why useful for kruskal algorithm?
When kruskal algorithm starts, all the edges are by themselves, with separate component in itself. so, when the edge is joined to the tree, the 2 components are connected. Suppose in Kruskal algorithm, the Tree has 4 connected components, and edge is coming with u,v vertices, with u in C1 and v in C2, hence 2 components join.
So, union find will contain the vertices as objects, and groups are connected components w.r.t currently connected edges.

Idea 1 : Maintain one linked structure per connected component of (V,T) , with each component has an arbitrary leader.

Invariant - each vertex points to the leader of its component

Key point - given edge (u,v) , we can check if u and v already in the same component in O(1) time (if and if only if leader pointer of u and v match)

Maintaining the variant problem
Also, if leaders dont match, we add to add the edge. But while adding the edge, we have to make sure that invariant of having 1 component pointing to 1 leader holds. In the worst case, it may take linear time to maintain the leader. So, again for each edge, it will lead to O(m.n) time, which is a issue. How to solve this?

To solve this lets have an idea 2:
Idea 2 : When two components merge, we should retain leader of 1 of the components, which will save us from pointer manipulation of atleast 1 component. Now suppose component A has 100 nodes, and B has 1000 nodes, so we have to keep the leader from component B as we have to then only re-write leader for 100 nodes of group A.

But still it doesn't solve our problem, as it will still take O(n) time for 1 edge to merge. Hence again O(m.n) times.

Idea 3 :
Now lets solve this problem in some other way - Lets think it in terms of each vertex term. Suppose that each vertex points to itself in start as there are n disconnected components to begin with. Now each time an edge is added, this vertex may or may not change its leader. So, how many times does a single vertex have its leader pointer update over the course of kruskal's algorithm? Answers is θ (log n).How?
So, when does  the vertex gets updated, when your component of say 20 vertices gets united with some other component of similar size. So, each time 2 components are united we have upper bound of  < = log 2 n times.

Running time of fast implementation
• Sorting - O(m log n ) time for sorting
• O (m) time for cycle check [O (1)  time for each edge, or per iteration - Idea 1)
• O (n log n) time overall for leader pointer updates (Using idea 3)
So, bottleneck for Kruskal algorithm is sorting ~ which is O(m log n) ie. O (|E| log |V|)