Monday, February 18, 2013

ICU is cool

I have always disliked iconv. Both iconv and icu are designed to convert text in various character encodings into other encodings. iconv is normally installed on your computer (in OSX or Linux etc.), but icu is much easier to use. The problem is that iconv doesn't provide any easy way to compute the length of the destination buffer, whereas in icu this is trivial. For example, if I want to compute how long a buffer to contain text will be when I convert it from utf-16 all I do is pass 0 as the destination length and a NULL buffer and it tells me how long to make it:

Having called that function, I simply allocate a buffer of exactly that length, pass it into the conversion function again, and Bob's your uncle. The way to do this in iconv is to guess how big to make it, then reallocate the destination buffer as often as needed during the chunk by chunk conversion. Then you can read the number of characters converted. What a messy way to do it! I particularly like the fact that icu does NOT require you to specify a locale, as iconv does for some obscure reason. That limits the conversions you can do to those locales installed on your machine, and you have to guess which of them is appropriate for the current conversion. That's just nuts.

Tuesday, February 5, 2013

Ukkonen's suffix tree algorithm

A suffix tree is a data structure that facilitates the finding of any substring of length M in a text of length N. A substring of length 6 (M) can then be found in a text of a million characters (N) in time proportional to M. This is much faster than the best string-searching algorithms, which take time proportional to the length of the text. Building a suffix tree does take time and space proportional to N, but this only needs to be done once. This makes suffix trees a very useful data structure for bioinformatics and various other textual applications.

Section 1 gives an overview of how the algorithm works. The remaining sections describe the various components of the algorithm: the phases, extensions, finding the suffix of the previous phase, suffix links, skipping extensions and completing the tree. The discussion is backed up by working C code that includes a test suite, a tree-printing module, and gnuplot files for precisely documenting cpu and memory usage.

1. Overview

Suffix trees used to be built from right to left. For the string "banana" you would create an empty tree, then insert into it "a", then "na", "ana", "nana", "anana" and finally "banana". Ukkonen's idea was to build the tree left to right by adding all the suffixes of progressively longer prefixes of the string. So first he would insert all the suffixes of "b", then all the suffixes of "ba", "ban", "bana", banan" and finally "banana". This looks a lot less efficient than the right to left method, because it multiplies the number of tree-insertions by a factor of N. However, by using a number of tricks the time complexity reduces from N3 to N.

The algorithm divides the building into N phases, each containing a number of extensions. The phases correspond to building the implicit (unfinished) suffix tree for each prefix of the overall string. The extensions correspond to inserting each of the suffixes of each prefix into the tree.

-[R]-ban*
    |
    -an*
    |
    -n*

Implicit tree of "banana" after phase 2

Listing 1

The first implicit tree called I0 is constructed manually. The empty root node is created and a leaf constructed containing the string "b". The algorithm then proceeds through N additional phases in each of which the tree is expanded to the right by one character. The terminal NULL character (written "$") is added as a unique suffix at the end so we can distinguish the suffix "anana$" from "ana$" (otherwise "ana" would be a prefix of "anana"). The set_e function will be described in Section 7 below.

2. Phases

Listing 2

A phase is just a succession of extensions. The global e variable represents the index of the last character represented in the tree. So if e equals 5 this means that all leaves end in the "a" at the end of "banana", and 6 means that the tree is complete with its terminal $. And if e equals 0 this means that the 0th character ("b") is in the tree. So for each phase we increment e and update the length of every leaf automatically.

The phase algorithm calls the extension algorithm with successively shorter suffixes of the current prefix (j..i). The loop starts with the value of old_j, which is the last value that j had in the previous phase, starting with 0. This optimisation is explained in section 6 below.

If the extension function returns 0 then the rest of the current phase can be skipped. How this works will be explained in the next section.

3. Extensions

To understand the extension algorithm you have to be clear about how the tree's data structure works:

  1. A character range j..i means the character str[j] up to and including the character str[i].
  2. Since there is only one incoming edge per node in a tree we can represent edges as belonging to their nodes. The edges in nodes are defined by their start position in the string and their length. So a start position of 1 in "banana" is the character "a" and if its length was 3 this would represent the string "ana". This way of storing strings has 3 advantages:
    1. Each node has the same size
    2. The text can be of any type: even wide UTF-16 characters.
    3. Since the text is not copied into the nodes the overall storage is proportional to N.
  3. A particular character in the tree is uniquely identified by a pointer to its node and an index into the string. I call this a "pos" or position.

Listing 3

The first step is to find the position of the suffix j..i-1 which was inserted during the previous phase. Each such substring, called β in Gusfield, is now extended by one character, so that the substring j..i is added to the tree. There are three possibilities:

  1. If the substring β is at the end of some leaf we just extend the leaf by one. (This step is automatic when we update via the e global).
  2. β ends at a node or in the middle of an edge and the next, or ith character, is not yet in the tree. If it ends at the start of a node we create a new leaf and attach it to that node as a child. If it ends in the middle of an edge we split the edge by adding a new internal node and attaching the leaf to it.
  3. β ends at a node or in the middle of an edge and the next, or ith character, is already in the tree. Since we were proceeding left to right, it follows that all the remaining suffixes in this phase are also suffixes of that earlier copy of this substring and must already be extended. So there is nothing more to do now and we can abort the rest of the phase.

The update_old_beta function is explained in Section 6, and update_current_link is explained in Section 4.

4. Suffix links

Navigating the tree between branches instead of down from the root converts the basic algorithm from time proportional to N3 to N2. To make this possible suffix links must be created. These record the path followed by the extension algorithm as it moves through the tree during a phase. They are then used as short-cuts when constructing the extensions of the following phase. A suffix link is set to point to the current internal node from the last such node created or found by rules 2 or 3 (see Listing 3). When rule 3 ends the phase prematurely there must already be a path leading back to the root from that point in the tree. The following suffix links are defined for the suffix tree of "banana$":

link: ana -> na
link: na -> a
link: a -> [R]
-[R]-banana$
    |
    -a-na-na$
    | |  |
    | |  -$
    | |
    | -$
    |
    -na-na$
    |  |
    |  -$
    |
    -$

In Listing 4 the update_current_link function sets the link of the last seen internal node "current" to the value of the next internal node.

Listing 4

5. Finding β

For each iteration of the extension algorithm the correct position for the new suffix j..i is at the end of the existing suffix j..i-1, or β. This string can be found naively by walking down from the root starting at character offset j. However, this requires navigating through a maximum of N nodes for each extension. A shortcut that takes constant time is needed, and can be concocted by following the suffix links.

Figure 1: Walking across the tree

The last position of i-1 in each extension can be used to find its position in the next extension by following the nearest suffix link. The node immediately above the last position will not in fact contain a suffix link, because this hasn't been set yet. We must therefore go one node further up the tree (see Figure 1) to the first node with a suffix link, or to the root. In doing so we trace a path called γ. After arriving at the end of the link we then walk down the branch, where we will find an exact copy of γ, to the new position for the next extension. The journey is complicated by the use of indices of characters, not the characters themselves. Also, we may encounter multiple nodes on our journey down the next branch. Since the length of the journey is determined by the local distances between nodes and not the size of the tree, informally it is clear that the time required will be constant with respect to N.

Listing 5

Listing 5 shows an implementation of the algorithm. There are four possibilities:

  1. If this is the first extension in the phase we just use the last value of β, extended by one character, from the previous phase.
  2. A range where j > i indicates the empty string. (Recall that in find_beta the value of i is that of the previous phase). In this case we are trying to extend the root by a single character.
  3. If the suffix is the entire string (starting at 0) this means the longest leaf, which is always pointed to by f.
  4. In all other cases we walk across the tree by first locating the position of the previous j..i substring. Then we walk up at least one node or to the root, follow the link and walk down following the same textual path. (If we do reach the root we must discard γ, because it will be incorrect. In this case we just walk down naively from the root.) Walking up does not require us to make any choices at each node since there is always only one parent, but on the way down we require a path to follow so that the correct children are selected at each node. So we save the "path" (A simple data type containing an offset and a length) during the up-walk, and destroy it once we have walked down the other side.

6. Skipping extensions

We have already established in the previous section that the time taken for each extension is constant. However, the number of extensions per phase is still proportional to N. Linear time complexity is attained by reducing this to a constant also.

A true suffix tree has exactly N leaves for a text of length N. Since the only rule in the extension algorithm that creates leaves is rule 2, and since "once a leaf always a leaf" it follows that on average rule 2 must be applied exactly once per phase. Similarly, rule 3 can at most be applied once per phase. We have already observed that the use of the e global makes all applications of rule 1 redundant. So, informally, each phase will take constant time if we can just skip all the leaf extensions and start with the first necessary application of rule 2.

An examination of the program's execution reveals that the rules are applied in order for each phase: first a number of rule 1s, then rule 2 and finally rule 3 (if at all). The applications of rules 2 and 3 for the string "banana" are:

applying rule 2 at j=1 for phase 1
applying rule 2 at j=2 for phase 2
applying rule 3 at j=3 for phase 3
applying rule 3 at j=3 for phase 4
applying rule 3 at j=3 for phase 5
applying rule 2 at j=3 for phase 6
applying rule 2 at j=4 for phase 6
applying rule 2 at j=5 for phase 6
applying rule 2 at j=6 for phase 6

So we only have to remember the position of the last inserted suffix after each application of rule 2 or 3 and this can then be used instead of β at the start of the next phase. Also the value of j can be the last value it had in the previous phase. This trick allows us to skip most of the extensions and reduce their number per phase to a constant value.

Listing 6

We remember the last position of j..i in the previous phase by extending the position of β by the ith character, as shown in Listing 6.

6. Finalising the tree

Leaf-nodes are extended automatically by setting their length to "infinity", which for practical purposes, can be INT_MAX in C (2147483647). Whenever we request the end of such a node the answer will then be the current value of e. However, this is inconvenient for a finished tree, in which the lengths of all nodes should be correctly set. We can do this simply by recursing down the tree, looking for leaves and setting their lengths to e-node_start(v)+1. The time cost is proportional to N, but in addition to that already incurred, so overall the algorithm remains O(N).

Listing 7

7. Demonstration of linearity

Ukkonen's original algorithm did not specify how to represent the multi-branching nodes of the tree. The choice is linked lists or hashtables. It turns out that hashtables are not much better than lists, even for plain text, and use more memory. Running the test program for either pure list nodes or a mixture of hashtables (for branches > 6) or lists for smaller nodes confirms that the time taken does indeed increase linearly with file size:

Here is the memory usage comparing plain lists and a hashtable when the node size exceeds 6:

If you set MAX_LIST_CHILDREN to INT_MAX in tree.c and recompile you will get the list representation.

References

E. Ukkonen, 1995. Online construction of suffix trees. Algorithmica 14.3, 249–260. http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

D. Gusfield, 1997. Linear-time construction of suffix trees in Algorithms on strings, trees and sequences, Cambridge University Press. http://www.stanford.edu/~mjkay/gusfield.pdf

D. Schmidt, 2013. A C implementation of Ukkonen's suffix tree-building algorithm, with test suite and tree print.

Using hash tables to improve scalability of Ukkonen's algorithm