Saturday, January 18, 2014

Remove duplicate characters in a string

This updated information has been further expanded upon on my new website. You can find the updated details here: https://k5kc.com/cs/algorithms/remove-duplicate-characters-in-a-string/.

Problem

Given a string, Write a program to remove duplcate characters from the string.

Input String : kodeknight
Output String : kodenight

Solution

Method 1 : Brute force
void removeDuplicates(char *str)
{   
    if (!str)
        return;    
    int len = strlen(str);
    if (len < 2) 
        return;
     
    int tail = 1;   
    for (int i = 1; i < len; ++i)
    {       
        int j;      
        for (j = 0; j < tail; ++j)           
            if (str[i] == str[j])               
                break;
         
        if (j == tail) 
        {           
            str[tail] = str[i];
            ++tail;
        }   
    }   
    str[tail] = '\0';
}

Time Complexity : O(N^2)


Method 2 : Using sorting
  1. Sort the elements.
  2. Now in a loop, remove duplicates by comparing the current character with previous character.
  3. Remove extra characters at the end of the resultant string.
Note that, this method doesn’t keep the original order of the input string.
Example
  1. Sort the elements - deghikknot
  2. Remove dupes, deghiknott
  3. Remove extra chars , here last t is extra, we have to simply replace it
Time complexity - O(n logn)

Method 3 - Using hash table
1. For each character, check if it is duplicate of already found characters in the hash array of 256 characters, assuming ASCII string
2. Skip duplicate characters and update the non duplicate characters.


void removeDuplicates(char *str)
{
     
    if (!str)           
        return;     
    int len = strlen(str);      
    if (len < 2)         
        return;     
    bool hit[256];      
    for(int i = 0; i < 256; ++i)         
        hit[i] = false;
         
    hit[str[0]] = true;
    int tail = 1;       
    for (int i = 1; i < len; ++i)
    {           
        if (!hit[str[i]])
        {
            str[tail] = str[i];             
            ++tail; 
            hit[str[i]] = true;
        }
    }
    str[tail] = '\0';
}

Time Complexity : O(N)



Putting the constraint of space and ordering
Suppose the problem becomes:
Write code to remove the duplicate characters in a string without using any additional buffer. NOTE: One or two additional variables are fine. An extra copy of the array is not.

Then what solution do we have?Method 2 cannot be used as we want to keep the ordering same.
Method 3 cannot be used, as it uses hashtable, which uses extra spaces.
Well, nothing is stopping us from using method 1, but it is brute force, and can we do better?

Method 4 - Special case of chars being a...z
The answer is yes, if the interviewer confirms, that your string is a constrained by say it can have only small case integers.
Since one or two additional variables are fine but no buffer is allowed, you can simulate the behaviour of a hashmap by using an integer to store bits instead.
This does not work for normal cases. It will only work from a..z case sensitive. No spaces supported. This would work miracles in a 256bit system to process the entire ASCII range. But it does run in O(N). I still +1 this one.

This simple solution runs at O(n), which is faster than yours. Also, it isn't conceptually complicated and in-place :

public static void removeDuplicates(char[] str) {
    int map = 0;
    for (int i = 0; i < str.length; i++) {
        if ((map & (1 << (str[i] - 'a'))) > 0) // duplicate detected
            str[i] = 0;
        else // add unique char as a bit '1' to the map
            map |= 1 << (str[i] - 'a');
    }
}

The drawback is that the duplicates (which are replaced with 0's) will not be placed at the end of the str[] array. However, this can easily be fixed by looping through the array one last time. Also, an integer has the capacity for only regular letters.

References - Stackoverflow

0 comments:

Post a Comment