Is it possible to count the number of distinct substrings in a string in O(n)?
You can use Ukkonen's algorithm to build a suffix tree in linear time:
https://en.wikipedia.org/wiki/Ukkonen%27s_algorithm
The number of substrings of s is then the number of prefixes of strings in the trie, which you can calculate simply in linear time. It's just total number of characters in all nodes.
For instance, your example produces a suffix tree like:
/\
b a
| b
b b
5 characters in the tree, so 5 substrings. Each unique string is a path from the root ending after a different letter: abb, ab, a, bb, b. So the number of strings is the number of letters in the tree.
More precisely:
- Every substring is the prefix of some suffix of the string;
- All the suffixes are in the trie;
- So there is a 1-1 correspondence between substrings and paths through the trie (by the definition of trie); and
- There is a 1-1 correspondence between letters in the tree and non-empty paths, because:
- each distinct non-empty path ends at a distinct position after its last letter; and
- the path to the the position following each letter is unique
NOTE for people who are wondering how it could be possible to build a tree that contains O(N^2) characters in O(N) time:
There's a trick to the representation of a suffix tree. Instead of storing the actual strings in the nodes of the tree, you just store pointers into the orignal string, so the node that contains "abb" doesn't have "abb", it has (0,3) -- 2 integers per node, regardless of how long the string in each node is, and the suffix tree has O(N) nodes.
Construct the LCP array and subtract its sum from the number of substrings (n(n+1)/2).