Ukkonen's algorithm for Generalized Suffix Trees
TL;DR
EDIT (03/2019): I have reworked my implementation to use C++17 string_view
to represent my substrings along with a caching mechanism that makes sure reference strings do not move around. The updated version can be found on github: https://github.com/Rerito/suffix-tree-v2. Here is the github for my old implementation (in C++) for the curious minds. Oh and the new take also includes tests!
The original algorithm doesn't really need to be modified in order to build a Generalized Suffix Tree.
Detailed analysis
The hunch I got was the right way. In order to keep up with the tricks used in the original algorithm, we indeed need to add a reference to the original string. Moreover, the algorithm is online, which means you can add strings on-the-fly to the tree.
Suppose we have a Generalized Suffix Tree GST(N) for strings (S1, ..., SN). The problem at hand here is how to handle the building process of GST(N+1), using GST(N).
Tweaking the data model
In the simple case (single suffix tree), each transition is a couple (substring, end vertex). The trick in Ukkonen's algorithm is to model a substring using a pair of pointers to the appropriate positions in the original string. Here, we need to also link such a substring to its "parent" string. To do so:
- Store the original strings in a hash table, giving them a unique integer key.
- A substring becomes now: (ID, (left pointer, right pointer)). We can thus use ID to fetch the original string in O(1).
We call this a mapped substring. The C++ typedefs I use are the ones found in my original question:
// This is a very basic draft of the Node class used
template <typename C>
class Node {
typedef std::pair<int, int> substring;
typedef std::pair<int, substring> mapped_substring;
typedef std::pair<mapped_substring, Node*> transition;
// C is the character type (basically `char`)
std::unordered_map<C, transition> g; // Called g just like in the article :)
Node *suffix_link;
};
As you will see, we will keep the reference pair concept as well. This time, the reference pair, just like the transition, will hold a mapped substring.
Note: As in C++, strings indexes will start at 0.
Inserting the new string
We want to insert SN+1 into GST(N).
GST(N) may have already a whole lot of nodes and transitions. In a simple tree, we would only have the root and the special sink node. Here, we may have transitions for SN+1 that have already been added through the insertion of some previous strings. The first thing to do is to walk down the tree through the transitions as long as it matches SN+1.
By doing so, we end in a state r. This state may be explicit (i.e. we ended right on a vertex) or implicit (a mismatch occurred in the middle of a transition). We use the same concept as in the original algorithm to model such a state: a reference pair. Quick example:
- We want to insert SN+1 =
banana
- A node s representing
ba
explicitly exists in GST(N) - s has a transition t for the substring
nal
When we walk down the tree, we end up in the transition t at the character l
which is a mismatch. Thus, the implicit state r we get is represented by the reference pair (s, m) where m is the mapped substring (N+1, (1,3)).
Here, r is the active point for the 5-th iteration of the algorithm in the building of banana
's suffix tree. The fact we got to that state means precisely that the tree for bana
is already built in GST(N).
In this example, we resume the algorithm at the 5-th iteration, to build the suffix tree for banan
using the tree for bana
. Not to lose the generality, we will state that r = (s, (N+1, (k, i-1)), i being the index of the first mismatch. We have indeed k ≤ i (the egality is synonym of r being an explicit state).
Property: We can resume Ukkonen's algorithm to build GST(N) at iteration i (insertion of the character at index i in SN+1). The active point for this iteration is the state r we got by walking down the tree. The only tweaking needed is some fetching operations to resolve the substrings.
Proof for the property
First, the presence of such a state r implies that the whole states for the intermediate tree T(N+1)i-1 are also there. So all is set up and we resume the algorithm.
We need to prove that each procedure in the algorithm remains valid. There are 3 such subroutines:
test_and_split
: given the character to insert at current iteration, test wether we need to split a transition into two separate transitions and if the current point is the end point.canonize
: Given a reference pair (n, m) where n is a vertex and m a mapped substring, returns the couple (n', m') representing the same state such as m' is the shortest possible substring.update
: Update GST(N) so that it has all the states for intermediate tree T(N+1)i at the end of the run.
test_and_split
Input: A vertex s, a mapped substring m = (l, (k, p)) and a character t.
Output: A boolean that tells if the state (s, m) is the end point for the current iteration and a node r representing explicitly (s, m) if it is not the end point.
The simplest case goes first. If the substring is empty (k > p), the state is already represented explicitly. We just have to test if we reached the end point or not. In a GST, just like in a common suffix tree, there is ALWAYS at most one transition per node starting with a given character. Thus, if there is a transition starting with t, we return true (we reached the end point), otherwise false.
Now the hard part, when k ≤ p. We first need to fetch the string Sl lying at index l(*) in the original strings table.
Let (l', (k', p')) (resp. s') be the substring (resp. the node) related to the transition TR of s starting with the character Sl(k) (*). Such a transition exists because (s, (l,(k,p)) represents an (existing) implicit state on the border path of the intermediate tree T(N+1)i-1. Furthermore, we are sure that the p - k first characters on this transition matches.
Do we need to split this transition? That depends on the Δ = p - k + 1-th character on this transition (**). To test this character, we need to fetch the string lying at index l' on the hash table and get the character at index k' + Δ. This character is guaranteed to exist because the state we are examining is implicit and thus ends in the middle of the transition TR (Δ ≤ p' - k').
If the equality holds, we have nothing to do and return true (the end point is here) and do nothing else. If not, then we must split the transition and create a new state r. TR now becomes (l', (k', k' + Δ - 1)) → r. Another transition is created for r: (l', (k' + Δ, p') → s'. We now return false and r.
(*): l is not necessarily equal to N+1. Likewise, l and l' may be different (or equal).
(**): Notice that the number Δ = p - k + 1 does not depend at all on the string chosen as a reference for the mapped substring. It only depends on the implicit state fed to the routine.
Canonize
Input: A node _s_and a mapped substring (l,(k,p)) representing an existing state e in the tree.
Output: A node s' and a mapped substring (l',(k',p')) representing the canonical reference for the state e
Using the same fetching tweaks, we just have to walk down the tree until we have exhausted the character "pool". Here, just like for test_and_split
the unicity of each transition and the fact we have an existing state as input provides us with a valid procedure.
Update
Input: The active point and the index for the current iteration.
Output: The active point for the next iteration.
update
uses both canonize
and test_and_split
which are GST-friendly. The suffix link editing is exactly the same as the one for a common tree. The only thing to notice is that we will create the open transitions (i.e. the transitions leading to nodes) using SN+1 as the reference string. Thus, at iteration i, the transition will always be linked to the mapped substring (N+1,(i,∞))
Final step
We need to "close" the open transitions. To do so, we just traverse them and edit the ∞ away, replacing it with L-1, where L is the length of SN+1. We also need a sentinel character to mark the end of the string. A character we are sure to never meet in any string. This way, the leaves will remain leaves forever.
Conclusion
The extra fetching work adds a few O(1) operations, increasing a little bit the constant factor of the complexity. Though, the asymptotic complexity remains obviously linear with the length of the inserted strings. Thus, building GST(N) from strings (S1,..., SN) with length n1,...,nN is:
c(GST(N)) = Σi=1..N ni