Fast string comparison in C

If I get your question correctly, you need to check if a string is along all the lines read so far. I would propose using a TRIE or even better a Patricia tree from the file lines. This way instead of going all over all the lines you can check linearly if your string is present(and with a little more effort - where).


Optimization of Computer Programs in C

You can save a little time by checking the first characters of the strings in question before doing the call. Obviously, if the first characters differ, there's no reason to call strcmp to check the rest. Because of the non-uniform distribution of letters in natural languages, the payoff is not 26:1 but more like 15:1 for uppercase data.

#define QUICKIE_STRCMP(a, b)  (*(a) != *(b) ? \  
  (int) ((unsigned char) *(a) - \
         (unsigned char) *(b)) : \
  strcmp((a), (b)))

If The dictionary of words you are using are well defined (meaning you don't mind return value form strcmp but the 0==equal), for example, a set of command line arguments that begins with same prefix, ex: tcp-accept, tcp-reject than you can rewrite the macro and do some pointer arithmetic to compare not the 1st one but the Nth char, in this case, the 4th char, ex:

   #define QUICKIE_STRCMP(a, b, offset) \
            (*(a+offset) != *(b+offset))\ ? -1 : strcmp((a), (b)))

strcmp is usually optimized by all vendors. However, if you're not satisfied with this you can try:

  • Lookup Burst Tries
  • Use a suffix tree for fast string comparison -- see this article
  • Depending on the size of strings in your application you can write a custom string comparator. E.g: GNU libc used to have this optimization for small strings where they tested strings smaller than five bytes as integers. MS cl also has some optimizations for small-strings (do look it up).

But more importantly make sure strcmp is your real bottleneck.


I can assure you, the function strcmp is ABSOLUTELY NOT the bottleneck. Typically, strcmp is well optimized and can do 32 or 64 bit comparisons for strings longer than 4/8 bytes depending on architecture. Both newlib and GNU libc do this. But even if you were to look at each byte in both strings 20 times, it doesn't matter as much as the algo & data structure choices made here.

The real bottle neck is the O(N) search algorithm. A single O(N log N) pass at the file could be used to at appropriate data structure (whether it's a normal BST, a trie, or just a simple sorted array) for doing O(log N) lookups.

Bear with me here--a lot of math follows. But I think this is a good opportunity to illustrate why choice of algorithm & data structure are sometimes FAR more important than method of string comparison. Steve touches on this, but I wanted to explain it in a little more depth.

With N=1e6, log(1e6, 2) = 19.9, so round up to 20 comparisons on an ideal data structure.

Currently you're doing a a worst case search of O(N), or 1e6 operations.

So say you just build a red-black tree with O(log N) insertion time, and you insert N items, that's O(N log N) time to build the tree. So that's 1e6 x 20 or 20e6 operations necessary to build your tree.

In your current approach, building the data structure is O(N), or 1e6 operations, but your worst case search time is O(N) as well. So by the time you read the file and do just 20 search operations, you're up to a theoretical worst case of 21,000,000 operations. By comparison, your worst case with a red-black tree and 20 searches is 20,000,400 operations, or 999,600 operations BETTER than the O(N) search on an unsorted array. So at 20 searches, you're at the first point where a more sophisticated data structure really pays off. But look at what happens at 1000 searches:

Unsorted array = initialization + 1000 x search time = O(N) + 1000 * O(N) = 1,000,000 + 2,000,000,000 = 2,001,000,000 operations.

Red-black = initialization + 1000 x search time = O(N log N) + 1000 * O(log N) = 20,000,000 + 20,000 = 20,020,000 operations.

2,001,000,000 / 20,020,000 ~= 100x as many operations for the O(N) search.

At 1e6 searches, that's (1e6 + 1e6 * 1e6) / (20e6 + 1e6 * 20 ) = 25,000x as many operations.

Assume your computer can handle the 40e6 'operations' it takes to do the log N searches in 1 minute. It would take 25,000 minutes, or 17 DAYS to do the same work with your current algorithm. Or another way to look at is that the O(N) search algorithm can only handle 39 searches in the time the O(log N) algorithm can do 1,000,000. And the more searches you do, the uglier it gets.

See responses from Steve and dirkgently for several better choices of data structures & algorithms. My only additional caution would be that qsort() suggested by Steve might have a worst-case complexity of O(N*N), which is far, far, worse than the O(N log N) you get with a heapsort or various tree-like structures.

Tags:

C

String

Compare