Tuesday, February 4, 2014

Determine if a string has all unique characters

This updated information has been further expanded upon on my new website. You can find the updated details here: https://k5kc.com/cs/algorithms/determine-if-a-string-has-all-unique-characters/.
Problem:
Implement an algorithm to determine if a string has all unique characters.

Solution

Solution1 - My first algorithm is a straightforward, brute force approach. Basically, we take each character in the string and compare it to every other character in the string. We do this using for loops. This would mean a time complexity of O(n^2) because of the nested loop.

bool uniqueChars(string myString)
{
    for (int x = 0; x < myString.length(); x++)
    {
        for (int y = 0; y < myString.length(); y++)
        {
            if (myString[x] == myString[y] && x != y) return false;
        }
    }
    return true;
}

Time complexity - O^2
Space complexity - O(1)

Improvements
This code is very simple, but there are some problems. First, comparisons are repeated. Let's say we have a string of five characters, and the string is "ABCDE". On the first iteration, we will compare A to B, then to C, then to D, then to E. On the second iteration, we will compare B to A, then to C, then to D, then to D. Do you see how the comparison between A and B is repeated on the second iteration?

We can optimize this code a bit by altering the loops so that each character is only compared to every character AFTER it. This would mean that once A is compared to all the other characters, the other characters never compare themselves to A again.
bool uniqueChars(string myString)
{
    for (int x = 0; x < myString.length() - 1; x++)
    {
        for (int y = x + 1; y < myString.length(); y++)
        {
            if (myString[x] == myString[y]) return false;
        }
    }
    return true;
} 

Time complexity - O(n^2)...still bad

Solution 2 - Using the 256 length array assuming the string is an ascii character
  1. Create an int array [0-255], one for each character index, initialised to zero.
  2. Loop through each character in the string and increment the respective array position for that character
  3. If the array position already contains a 1, then that character has already been encountered. Result => Not unique.
  4. If you reach the end of the string with no occurrence of (3), Result => the string is unique.

Here is the function in java
public boolean uniqueChars(String s) {
    boolean[] mask = new boolean[256];
    for (int i = 0; i < s.length(); i++) {
        if (mask[s.charAt(i)])
            return false;
        mask[s.charAt(i)] = true;
    }
    return true;
}

Solution 3 - Sorting the array and then finding unique chars

Here using the sorting algorithm to sort the string first, then compare every pair of neighboring characters to see if they are the same.  The main time consuming are sorting array + neighborhood comparisions. But based on the document I searched online, the sorting function in Java uses merge sort or quicksort, which depends on the data type, also for efficiency, it would switch to insertion sort when fewer than seven array elements are being sorted. So, in general,
Time: O(nlgn), n is the length of string.

public boolean uniqueChars(String s) {
    char[] chars = s.toCharArray();
    Arrays.sort(chars);
    for (int i = 0; i < chars.length - 1; i++) {
        if (chars[i] == chars[i + 1])
            return false;
    }
    return true;
}

What if not use additional data structure?

Solution1 and solution 3 still work here. But in the interview if someone gives you a restricted string which can have only say 'a' and 'z' and asks you to not use any data structure then?

Solution 4
For this solution, the condition is very limited. Here I use a mask to keep the occurrence of characters with corresponding bits. E.g. 'a' occurrence will be stored at bit 0, while 'c' occurrence will be stored at bit 2. So, if mask is 4, that means 'c' has appeared before, and only 'c' has appreared. But the cons of this approach is its limited the input string could only contain continuous and easy handling characters, while the length of these characters could not be larger than log2(MAX_INT or MAX_LONG). Here, the integer value is 32-bit, so we could have this approach.
public static boolean isUniqueChars(String str) {
    int checker = 0;
    for (int i = 0; i < str.length(); ++i) {
        int val = str.charAt(i) - 'a';
        if ((checker & (1 << val)) > 0) return false;
        checker |= (1 << val);
    }
    return true;
}

Understanding the code here.
int checker is used here as a storage for bits. Every bit in integer value can be treated as a flag, so eventually int is an array of bits (flag). Each bit in your code states whether the character with bit's index was found in string or not. So, lets consider a, it is no 1st bit set, C will have 3rd bit set and so on.
00000000000000000000000000000001 a 2^0
00000000000000000000000000000010 b 2^1
00000000000000000000000000000100 c 2^2
00000000000000000000000000001000 d 2^3
--------------------------------------so on
00000001000000000000000000000000 y 2^24
00000010000000000000000000000000 z 2^25

Initially checker is equal to 0 i.e.
checker=00000000000000000000000000000000

Consider the string ayza
To, begin with we will have a as char,
a= 00000000000000000000000000000001

checker='a' or checker
checker=00000000000000000000000000000001
a and checker=0 no dupes condition

string 'az'
checker=00000000000000000000000000000001
z =00000010000000000000000000000000
z and checker=0 no dupes


checker=z or checker= 00000010000000000000000000000001
string 'azy'
checker=00000010000000000000000000000001
y =00000001000000000000000000000000
checker and y=0 no dupes condition

checker= checker or y =00000011000000000000000000000001
string 'azya'
checker= 00000011000000000000000000000001
a= 00000000000000000000000000000001
a and checker=1 we have a dupe


 You could use bit vector for the same reason instead of int. There are two differences between them:
  • Size. int has fixed size, usually 4 bytes which means 8*4=32 bits (flags). Bit vector usually can be of different size or you should specify the size in constructor.
  • API. With bit vectors you will have easier to read code, probably something like this:
    vector.SetFlag(4, true); // set flag at index 4 as true
    for int you will have lower-level bit logic code:
    checker |= (1 << 5); // set flag at index 5 to true
Also probably int may be a little bit faster, because operations with bits are very low level and can be executed as-is by CPU. BitVector allows writing a little bit less cryptic code instead plus it can store more flags.
For future reference: bit vector is also known as bitSet or bitArray. Here are some links to this data structure for different languages/platforms:

0 comments:

Post a Comment