Monday, April 26, 2010

Missing integers

If you have a cheap mainstream commodity hardware consisting of a decent processor and 1GB RAM, how would you find a few (more than one) missing integers in a list saved on disk containing all of the unique 2^32 integers?

One approach is to use an array of boolean flags and marking them as true for present integers (index of the array denotes flag for the integer). Traversing the flags array again would give the missing integers.

However, 2^32 integers each of 4B require 16GB of space alone. A boolean array for 2^32 locations each of 1B would be 4GB. We can deal with the integers list by reading manageable chunks as we need it in memory only to set the flags i.e. we do not need it completely in memory. But, if we cannot save the flags completely in memory, we have to divide it to manageable parts and make numerous passes through the list which is tedious.

The trick for managing the flags array would be to use a bit as a flag and do bit wise operations for setting and getting the bits. Using bits as flags for 2^32 locations would require only 512MB (8 times less than boolean array).

For setting the nth bit in int x, we OR x with 2^n.


set bit 6
76543210
00001000 = x
01000000 = 2^6
01001000 = x | 2^6 //new x

setBit(x, pos)
    x = x | 2^pos


Unsetting the nth bit in x, we AND x with an operand with only the nth bit unset.


unsetBit(x, pos)
    x = x & (2^8 -1 - 2^pos)

unset bit 6
76543210
01001000 = x
11111111 = 2^8 -1 // all 1s
01000000 = 2^6
10111111 = (2^8 -1 - 2^6) //operand with nth bit unset
00001000 = x & (2^8 -1 - 2^6) //new x


One thing to note is that, ORing anything lesser than int would implicitly promote the operands to int. On using char (2B), which can be treated as unsigned, too many implicit conversions might result in an Out of Memory exception. 

For any number k and m bits for datatype used as flag (for signed or unsigned int, m = 31 or 32), appropriate location in the flags array is k/m and the appropriate bit within that location is k%m. I used m = 31 as Java does not have unsigned int.

Algorithm:

  • Read manageable chunks of integers from the list 
  • Populate the flags array using setbit() method for the read chunks
  • While traversing the flags array, traverse each of the bits of the flag to find the missing integers (see below)

int m = 31 // bits per flag i.e. signed int
for (int i = 0; i < flags.length; i++):
    for (j = m-1; j >= 0; j--):
        if flags[i] & 2^j == 0:
            print i*m +j //missing


Note: In Java, we have to set the max / min Java heap size (-Xmx) at runtime and since array indexes are positive integers only, we can have an array of maximum size Integer.MAX_VALUE i.e. 2^31 -1 (this can different for other languages).

0 comments:

Post a Comment