C++ string::find complexity
Why the c++'s implemented string::substr() doesn't use the KMP algorithm (and doesn't run in O(N + M)) and runs in O(N * M)?
I assume you mean find()
, rather than substr()
which doesn't need to search and should run in linear time (and only because it has to copy the result into a new string).
The C++ standard doesn't specify implementation details, and only specifies complexity requirements in some cases. The only complexity requirements on std::string
operations are that size()
, max_size()
, operator[]
, swap()
, c_str()
and data()
are all constant time. The complexity of anything else depends on the choices made by whoever implemented the library you're using.
The most likely reason for choosing a simple search over something like KMP is to avoid needing extra storage. Unless the string to be found is very long, and the string to search contains a lot of partial matches, the time taken to allocate and free that would likely be much more than the cost of the extra complexity.
Is that corrected in c++0x?
No, C++11 doesn't add any complexity requirements to std::string
, and certainly doesn't add any mandatory implementation details.
If the complexity of current substr is not O(N * M), what is that?
That's the worst-case complexity, when the string to search contains a lot of long partial matches. If the characters have a reasonably uniform distribution, then the average complexity would be closer to O(N)
. So by choosing an algorithm with better worst-case complexity, you may well make more typical cases much slower.
Where do you get the impression from that std::string::substr()
doesn't use a linear algorithm? In fact, I can't even imagine how to implement in a way which has the complexity you quoted. Also, there isn't much of an algorithm involved: is it possible that you are think this function does something else than it does? std::string::substr()
just creates a new string starting at its first argument and using either the number of characters specified by the second parameter or the characters up to the end of the string.
You may be referring to std::string::find()
which doesn't have any complexity requirements or std::search()
which is indeed allowed to do O(n * m) comparisons. However, this is a giving implementers the freedom to choose between an algorithm which has the best theoretical complexity vs. one which doesn't doesn't need additional memory. Since allocation of arbitrary amounts of memory is generally undesirable unless specifically requested, this seems a reasonable thing to do.
FYI, The string::find in both gcc/libstdc++ and llvm/libcxx were very slow. I improved both of them quite significantly (by ~20x in some cases). You might want to check the new implementation:
GCC: PR66414 optimize std::string::find https://github.com/gcc-mirror/gcc/commit/fc7ebc4b8d9ad7e2891b7f72152e8a2b7543cd65
LLVM: https://reviews.llvm.org/D27068
The new algorithm is simpler and uses hand optimized assembly functions of memchr, and memcmp.