Suffix Trees

Rahul Pandey | February 14, 2017

The suffix tree, invented in 1973, is probably the most powerful and versatile data structure for string processing. The suffix tree represents all the suffixes of a string S in O(n) space and can be built in O(n) time, where n is the length of the string.

Contents

Summary

A suffix tree is a compressed trie containing all the suffixes of a given text as keys, and positions in the text as their values. We’ll unpack what that means, but suffix trees enable particularly fast implementations of many important string operations, such as:

The tradeoff for this power is that the storage for a string’s suffix tree requires several multiples more space than storing the string itself.

Suffix Tries

Let’s start by talking about a suffix trie, which isn’t a very practical data structure but it helps us better understand the tree in the next section. The suffix trie is the smallest tree such that:

Here is the suffix trie for the string abaaba. The symbol $ indicates the end of the word.

Suffix trie

Take, for example, the suffix ba. If we trace its path in the tree, we find two branches at that node, one with a $ (indicating that this is a suffix), and another branch which indicates that ba is a substring elsewhere in the string.

Try answering these questions about the suffix trie:

How do we count the number of times a string S occurs as a substring of T? Follow the path corresponding to S (answer is 0 if we fall off the tree) to the node n, and the answer is the # of leaf nodes in the subtree rooted at n. Try this out with the aba and you should see the answer is 2.


How do we find the longest repeated substring of T? The depth of the tree represents the length of the substring, so we want to find the node of highest depth in T which has more than 1 child (indicating that it is repeated). In this case, the answer is aba, since it is the only substring corresponding to a node of depth 3 with 2 children.


The downside of a suffix trie is that the amount of space required to store the trie grows quadratically with respect to the length of the input string. So for even relatively small strings, say 500 characters, the number of nodes required in the trie could be more than 100K. This leads us to a suffix tree, which compresses the tree.

Suffix Trees v1

The suffix tree is very similar to the trie, except we coalesce non-branching paths into a single edge with a string label (so we have strings on edges instead of characters). This reduces the number of nodes and edges in the tree, and guarantees that internal nodes have > 1 child.

So now our string abaaba has this suffix tree:

Suffix tree1

Some observations about the number of nodes in the trie:

However, if you add the lengths of all the strings on the edges, the storage space is still O(n2) edges, so we don’t have a linear space construction yet.

Suffix Trees v2

Instead of having strings labelling the edges, we simply store a pair of numbers: the start and end index for where the substring first occurs in the string. The leaves will store the offset of the suffix that the leaf corresponds to– each leaf will thus have a label from 1 to n of which suffix in the tree it represents. So the final suffix tree looks like this:

Suffix tree2

Formally, the suffix tree for a string S of length n is defined as a tree such that:

Building

We don’t want to build the suffix tree how we described it, since building a suffix trie would need at least O(n2) time and space for each of the O(n2) nodes.

We can improve on this by building up the suffix tree by walking forward through the string. We build a single-edge tree representing only the longest suffix, then augment to include the 2nd longest, then augment to include the 3rd longest, and so on. While this is linear in terms of space usage, this approach is still O(n2) time.

There is a remarkable algorithm called Ukkonen’s algorithm which can construct the suffix tree in O(n) time and space! We won’t discuss it here (it has lots of casework), but check out the reference if you want to learn more about it. Furthermore, Ukkonen’s algorithm provides an “online” construction, which means it will construct the suffix tree as you provide it the characters of your string one character at a time.

Generalized Suffix Trees

Generalized suffix trees are when we build a suffix tree over multiple strings composed together. This is how we actually solve the longest common substring problem in linear time. If we want to find the longest common substring of two string X and Y, we generate the suffix tree for X#Y$, where # and $ are the two special end of string characters. Once we have built up the suffix tree (using Ukkonen’s Algorithm), we need to find the deepest node which has leaves corresponding to indices in both string X and Y (which we’d know from the lengths of the strings). This operation requires a depth-first search, which is still linear in time and space.

You can see the power and versatility that emerge from this construction!

Real World Application

Suffix trees are ideal anytime we can afford to spend a lot of time pre-processing the text, but we want to answer queries about the text very quickly.

One important use for suffix trees is in bioinformatics, where we often want to search for patterns in DNA or protein sequences (which can be viewed as strings of characters). In these applications, we generally want to find a short string (called the pattern) among a much larger string (simply called the string), either with exact or inexact matching.

Another use of suffix trees is to find the longest common substring among a set of strings, or even longest common subsequence (that requires a bit more finagling). They are also commonly used to identify repetitive structures, as is often the case in human genomes. In fact, suffix trees are used in some data compression algorithms because of their ability to identify repeated data.

Sources