## Tuesday, April 1, 2014

### Generate an integer which is not in a file with 4 billion integers, limited memory

Problem
Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB of memory.
What if you have only 10 MB of memory?
Solution
Case 1 - If very large integers are allowed then one can generate a number that is statistically likely to be unique in O(1) time. For example, a psuedo random 128 bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.

Case 2 - If integers are limited to 32 bits

In Java, an integer is 32-bit.
The size of the file is 4×109×4 bytes = 16 GB. 4 billion integers would quickly eat up 16 GB of memory. Therefore we cannot put them all in the main memory. We can instead use a boolean array or bit array.

Solutions for case 2

Method 1 - Using Boolean array
As 1 Boolean takes 1 byte, we will have 4 Billion bytes.Each cell in position i of the array represent whether number i is present in the file or not. Then when generating new numbers, we just need to do a search through the array to find a spot whose value is false, meaning that number is absent from the file. However, in Java, the maximum length of an array is around 2^32. So we can make a 2D array with indexing to do this.
Java code :
```public static void generateNumber(String fileName) {
final int SIZE = Integer.MAX_VALUE / 2;
boolean[][] array = new boolean[SIZE];
try {
DataInputStream in = new DataInputStream(
new FileInputStream(fileName);
String line = new String();

long number = Long.parseLong(line);
int which = (int) (number / SIZE);
int pos = (int) (number % SIZE + which * SIZE);
array[which][pos] = true;
}
} catch (IOException e) {
e.printStackTrace();
}

for (int i = 0; i < array.length; ++i) {
for (int j = 0; j < array[i].length; ++j)
if (!array[i][j]) {
System.out.println(i * SIZE + j);
break;
}
}
}
```
However this method will not able to deal with when we have only 10MB of memory. This is where bit vectors come into picture.

Method 2 - Using bit vectors

A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10
Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :) Coming to the problem, There are a total of 2^32, or 4 billion, distinct integers possible. .

4 billion numbers can be represented in :
``````4 billion bits = (4000000000 / 8) bytes = about 0.466 GB
= 466 MB ``````≈ 500 MB
``````

We have 1 GB of memory, or 8 billion bits.Thus, with 8 billion bits, we can map all possible integers to a distinct bit with the available memory. The logic is as follows:

• Create a bit vector (BV) of size 4 billion.
• Initialize BV with all 0’s.
• Scan all numbers (num) from the file and write BV[num] = 1;
• Now scan again BV from 0th index
• Return the first index which has 0 value.

```public static void findOpenNumber2(String fileName)
throws FileNotFoundException {
byte[] bitfield = new byte[0xFFFFFFF / 8];
Scanner in = new Scanner(new FileReader(fileName));
while (in.hasNextInt()) {
int n = in.nextInt();

// Finds the corresponding number in the bitfield by using the
// OR operator to set the nth bit of a byte (e.g.. 10 would
// correspond to the 2nd bit of index 2 in the byte array).

bitfield[n / 8] |= 1 << (n % 8);
}

for (int i = 0; i < bitfield.length; i++) {
for (int j = 0; j < 8; j++) {

// Retrieves the individual bits of each byte. When 0 bit
// is found, finds the corresponding value.

if ((bitfield[i] & (1 << j)) == 0) {
System.out.println(i * 8 + j);
return;
}
}
}
}
```

Follow Up: What if we have only 10 MB memory?

Method 1 - Dividing in blocks
If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory.

To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of `32/8 = 4` passes.

Here it’s possible to find a missing integer with just two passes of the data set. We can divide up the integers into blocks of some size (we’ll discuss how to decide on a size later). Let’s just assume that we divide up the integers into blocks of 1000. So, block 0 represents the numbers 0 through 999, block 1 represents blocks 1000 - 1999, etc. Since the range of ints is finite, we know that the number of blocks needed is finite.
In the first pass, we count how many ints are in each block. That is, if we see 552, we know that that is in block 0, we increment counter. If we see 1425, we know that that is in block 1, so we increment counter.

At the end of the first pass, we’ll be able to quickly spot a block that is missing a number. If our block size is 1000, then any block which has fewer than 1000 numbers must be missing a number. Pick any one of those blocks.
In the second pass, we’ll actually look for which number is missing. We can do this by creating a simple bit vector of size 1000. We iterate through the file, and for each number that should be in our block, we set the appropriate bit in the bit vector. By the end, we’ll know which number (or numbers) is missing.

Now we just have to decide what the block size is.
A quick answer is 2^20 values per block. We will need an array with 2^12 block counters and a bit vector in 2^17 bytes. Both of these can comfortably fit in 10*2^20 bytes.
What’s the smallest footprint? When the array of block counters occupies the same memory as the bit vector. Let N = 2^32.

counters (bytes): blocks * 4
bit vector (bytes): (N / blocks) / 8
blocks * 4 = (N / blocks) / 8
blocks^2 = N / 32
blocks = sqrt(N/2)/4

It’s possible to find a missing integer with just under 65KB or 65536 bytes(or, more exactly, sqrt(2)*2^15 bytes).
So, here is a basic algo:

1. Loop1 on blocks : Loop through all blocks of integer of size N, and create a block count set
countSet.add(blockNumber, number of integers in the block)
and select a block where size of block is less than N, we call it MissingIntegerBlock.
2. Loop2 on missing integer block : Iterate over MissingIntegerBlock, and select the integer missing, and we found our answer.
Java code

```int bitsize = 1048576; // 2^20 bits (2^17 bytes)
int blockNum = 4096; // 2^12
byte[] bitfield = new byte[bitsize / 8];
int[] blocks = new int[blockNum];

void findOpenNumber(String fileName) throws FileNotFoundException {
int starting = -1;
Scanner in = new Scanner(new FileReader(fileName));
while (in.hasNextInt()) {
int n = in.nextInt();
blocks[n / (bitfield.length * 8)]++;
}

for (int i = 0; i < blocks.length; i++) {
if (blocks[i] < bitfield.length * 8) {
// if value < 2^20, then at least 1 number is missing in
// that section.
starting = i * bitfield.length * 8;
break;
}
}

while (in.hasNextInt()) {
int n = in.nextInt();
// If the number is inside the block that’s missing numbers,
// we record it
if (n >= starting && n < starting + bitfield.length * 8) {
bitfield[(n - starting) / 8] |= 1 << ((n - starting) % 8);
}
}

for (int i = 0; i < bitfield.length; i++) {
for (int j = 0; j < 8; j++) {
// Retrieves the individual bits of each byte. When 0 bit
// is found, finds the corresponding value.
if ((bitfield[i] & (1 << j)) == 0) {
System.out.println(i * 8 + j + starting);
return;
}
}
}
}
```

Method 2 (for follow up question) - Divide and conquer based on counts
This can be solved in very little space using a variant of binary search.
1. Start off with the allowed range of numbers, `0` to `4294967295`.
2. Calculate the midpoint.
3. Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.
4. If no numbers were equal, you're done. The midpoint number is the answer.
5. Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.
This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.

References
http://tianrunhe.wordpress.com/2012/04/08/generate-an-integer-which-is-not-in-a-file-with-4-billion-integers-limited-memory/
http://stackoverflow.com/questions/7153659/find-an-integer-not-among-four-billion-given-ones