function to remove duplicate characters in a string

Your code is, I'm sorry to say, very C-like.

A Java String is not a char[]. You say you want to remove duplicates from a String, but you take a char[] instead.

Is this char[] \0-terminated? Doesn't look like it because you take the whole .length of the array. But then your algorithm tries to \0-terminate a portion of the array. What happens if the arrays contains no duplicates?

Well, as it is written, your code actually throws an ArrayIndexOutOfBoundsException on the last line! There is no room for the \0 because all slots are used up!

You can add a check not to add \0 in this exceptional case, but then how are you planning to use this code anyway? Are you planning to have a strlen-like function to find the first \0 in the array? And what happens if there isn't any? (due to all-unique exceptional case above?).

What happens if the original String/char[] contains a \0? (which is perfectly legal in Java, by the way, see JLS 10.9 An Array of Characters is Not a String)

The result will be a mess, and all because you want to do everything C-like, and in place without any additional buffer. Are you sure you really need to do this? Why not work with String, indexOf, lastIndexOf, replace, and all the higher-level API of String? Is it provably too slow, or do you only suspect that it is?

"Premature optimization is the root of all evils". I'm sorry but if you can't even understand what the original code does, then figuring out how it will fit in the bigger (and messier) system will be a nightmare.


My minimal suggestion is to do the following:

  • Make the function takes and returns a String, i.e. public static String removeDuplicates(String in)
  • Internally, works with char[] str = in.toCharArray();
  • Replace the last line by return new String(str, 0, tail);

This does use additional buffers, but at least the interface to the rest of the system is much cleaner.


Alternatively, you can use StringBuilder as such:

static String removeDuplicates(String s) {
    StringBuilder noDupes = new StringBuilder();
    for (int i = 0; i < s.length(); i++) {
        String si = s.substring(i, i + 1);
        if (noDupes.indexOf(si) == -1) {
            noDupes.append(si);
        }
    }
    return noDupes.toString();
}

Note that this is essentially the same algorithm as what you had, but much cleaner and without as many little corner cases, etc.


The function looks fine to me. I've written inline comments. Hope it helps:

// function takes a char array as input.
// modifies it to remove duplicates and adds a 0 to mark the end
// of the unique chars in the array.
public static void removeDuplicates(char[] str) {
  if (str == null) return; // if the array does not exist..nothing to do return.
  int len = str.length; // get the array length.
  if (len < 2) return; // if its less than 2..can't have duplicates..return.
  int tail = 1; // number of unique char in the array.
  // start at 2nd char and go till the end of the array.
  for (int i = 1; i < len; ++i) { 
    int j;
    // for every char in outer loop check if that char is already seen.
    // char in [0,tail) are all unique.
    for (j = 0; j < tail; ++j) {
      if (str[i] == str[j]) break; // break if we find duplicate.
    }
    // if j reachs tail..we did not break, which implies this char at pos i
    // is not a duplicate. So we need to add it our "unique char list"
    // we add it to the end, that is at pos tail.
    if (j == tail) {
      str[tail] = str[i]; // add
      ++tail; // increment tail...[0,tail) is still "unique char list"
    }
  }
  str[tail] = 0; // add a 0 at the end to mark the end of the unique char.
}

Tags:

Java

String