Problem

Solution

Forget that we’re dealing with millions of users at first. Design this for the simple case.

We can construct a graph(preferably undirected) by assuming every person is a node and if there is an edge between two nodes, then the two people are friends with each other.

An undirected graph of some variety, probably stored as a sparse adjacency matrix. As far as finding the shortest path between two people, since the cost of edges is uniform, I would consider going with a bidirectional search.

Basically, go out in concentric circles centered around each person, where each circle is the person himself, then his friends, then his friends-of-friends, etc., at each step testing if there is any person in both circles. Follow the path from the first person you find inward toward the center of each person, and you've found the shortest path.

You could try other shortest-path algorithms, but in general, most shortest-path algorithms only give you the distance and not the actual path.

If I want to find the connection between two people, I would start with one person and do a simple breadth first search.

But... oh no! Millions of users!

When we deal with a service the size of Orkut or Facebook, we cannot possibly keep all of our data on one machine. That means that our simple Person data structure from above doesn’t quite work—our friends may not live on the same machine as us. Instead, we can replace our list of friends with a list of their IDs, and traverse as follows:

For each friend ID:

Go to machine machine_index

Optimization: Reduce Machine Jumps

Jumping from one machine to another is expensive. Instead of randomly jumping from machine to machine with each friend, try to batch these jumps—e.g., if 5 of my friends live on one machine, I should look them up all at once.

Optimization: Smart Division of People and Machines

People are much more likely to be friends with people who live in the same country as them. Rather than randomly dividing people up across machines, try to divvy them up by country, city, state, etc. This will reduce the number of jumps.

Question: Breadth First Search usually requires “marking” a node as visited. How do you do that in this case?

Usually, in BFS, we mark a node as visited by setting a flag visited in its node class. Here, we don’t want to do that (there could be multiple searches going on at the same time, so it’s bad to just edit our data). In this case, we could mimic the marking of nodes with a hash table to lookup a node id and whether or not it’s been visited.

Other Follow-Up Questions:

In the real world, servers fail. How does this affect you?

How could you take advantage of caching?

Do you search until the end of the graph (infinite)? How do you decide when to give up?

In real life, some people have more friends of friends than others, and are therefore more likely to make a path between you and someone else. How could you use this data to pick where you start traversing?

The following code demonstrates our algorithm:

The complete code can be downloaded from github.Thanks.

How would you design the data structures for a very large social network (Facebook, LinkedIn, etc)? Describe how you would design an algorithm to show the connection, or path, between two people (e.g., Me -> Bob -> Susan -> Jason -> You).

Solution

Forget that we’re dealing with millions of users at first. Design this for the simple case.

We can construct a graph(preferably undirected) by assuming every person is a node and if there is an edge between two nodes, then the two people are friends with each other.

class Person { Person[] friends; // Other info }

An undirected graph of some variety, probably stored as a sparse adjacency matrix. As far as finding the shortest path between two people, since the cost of edges is uniform, I would consider going with a bidirectional search.

Basically, go out in concentric circles centered around each person, where each circle is the person himself, then his friends, then his friends-of-friends, etc., at each step testing if there is any person in both circles. Follow the path from the first person you find inward toward the center of each person, and you've found the shortest path.

You could try other shortest-path algorithms, but in general, most shortest-path algorithms only give you the distance and not the actual path.

If I want to find the connection between two people, I would start with one person and do a simple breadth first search.

But... oh no! Millions of users!

When we deal with a service the size of Orkut or Facebook, we cannot possibly keep all of our data on one machine. That means that our simple Person data structure from above doesn’t quite work—our friends may not live on the same machine as us. Instead, we can replace our list of friends with a list of their IDs, and traverse as follows:

For each friend ID:

int machine_index = lookupMachineForUserID(id);

Go to machine machine_index

Person friend = lookupFriend(machine_index);

Optimization: Reduce Machine Jumps

Jumping from one machine to another is expensive. Instead of randomly jumping from machine to machine with each friend, try to batch these jumps—e.g., if 5 of my friends live on one machine, I should look them up all at once.

Optimization: Smart Division of People and Machines

People are much more likely to be friends with people who live in the same country as them. Rather than randomly dividing people up across machines, try to divvy them up by country, city, state, etc. This will reduce the number of jumps.

Question: Breadth First Search usually requires “marking” a node as visited. How do you do that in this case?

Usually, in BFS, we mark a node as visited by setting a flag visited in its node class. Here, we don’t want to do that (there could be multiple searches going on at the same time, so it’s bad to just edit our data). In this case, we could mimic the marking of nodes with a hash table to lookup a node id and whether or not it’s been visited.

Other Follow-Up Questions:

In the real world, servers fail. How does this affect you?

How could you take advantage of caching?

Do you search until the end of the graph (infinite)? How do you decide when to give up?

In real life, some people have more friends of friends than others, and are therefore more likely to make a path between you and someone else. How could you use this data to pick where you start traversing?

The following code demonstrates our algorithm:

public class Server { public ArrayList<Machine> machineList = new ArrayList<Machine>() ; //private int MachineNumber ; } package com.vaani.ood.SocialNetwork; import java.util.ArrayList; public class Machine { public ArrayList<Person> personList = new ArrayList<Person>() ; public int MachineID ; public int machineID; public Machine(int id) { this.MachineID = id ; } public void setMachineID(int id) { this.MachineID = id ; } public boolean LookUpPerson(Person p) { return this.MachineID == p.getItsMachineID() ; // look up if the person in this machine } public void addToMachine(Person p) { personList.add(p); } public ArrayList<Person> getPersonListFromMachine() { return personList ; } } public class Person { private String info ; // person information private int ID ; private ArrayList<Integer> friendList ; // use Friend ID Friends List private int MachineID ; // person store in which machine private Server server = new Server() ; public Person(String inform, int PID , int machineID ) { this.info = inform ; this.ID = PID ; this.MachineID = machineID ; } public void setPeronInform(String inform) { this.info = inform ; } public void setPersonID(int ID) { this.ID = ID ; } public void addToFreindList(int FreindID) { friendList.add(FreindID) ; } public int getPersonID () { return ID ; } public String getPersonInform() { return info ; } public int getItsMachineID() { return MachineID ; } public int[] getFriendList() // get friendID from the freindList { int[] temp = new int[friendList.size()] ; for(int i = 0 ; i <friendList.size(); i ++ ) { temp[i] = friendList.get(i) ; } return temp ; } public Person LookUpFriends(int ID, int machineID) // give a userID, machineID look for friend { for(Machine m : server.machineList) { if (m.machineID == machineID) { for(Person p : m.personList) { if (p.ID== ID) return p; } } } return null ; } public Machine LookUpMachine(int machineID) // given a machineID to look for machine { for(Machine m : server.machineList) { if (m.machineID == machineID) return m ; } return null ; } }

The complete code can be downloaded from github.Thanks.

## 0 comments:

## Post a Comment