Wednesday, December 18, 2013

Putting content on a specific part of a Drupal page

I wanted to add some buttons to my page. The best place looked like it was on the right hand side of the menu bar. That way I wouldn't waste any vertical space and the buttons only needed to be small and inconspicuous (they were for adding and deleting annotations). The problem was how to do it. You couldn't do that by just assigning the text to an existing region so I defined a new one. I called it "Annotations". In my theme's .info file I added the line:

regions[annotator] = Annotations

But you could call it anything you like. The next step was to create a module that could generate the content. I created a trivial "annotator" module that had a hook_block_view function. This generated some content when the module was being rendered by the template.

So to get the module called I had to install and enable it. Then in the Structure section of admin I told Drupal to assign the annotator module's output to the "Annotations" region.

Finally, I edited the template (as it turned out the specific template for that page, but it could be the general one) in my theme so that just after it has rendered the menu it then tests for the existence of the "annotator" module and renders the content:

<?php if ($page['annotator']) : ?>
<?php
    if (module_exists('annotator'))
        print drupal_render($page['annotator']);
?>
<?php endif; ?>

All that remains to be done is to style the module output so that it floats to the right and Bob's your uncle. The next step will be to hide them from casual browsers since only logged-in users are supposed to be able to annotate.

Thursday, December 12, 2013

Login/logout to Drupal remotely in PHP

Previously I explained how to login/logout to Drupal using Java. It's far easier in PHP because of the curl module, which can easily be added to any php installation. Most the http header fields are handled for you so the code is pretty simple. A couple of tricks though: you must set CURLOPT_FOLLOWLOCATION on the connection when reading the logout response or you'll just get a redirect. When logging in that doesn't matter as the cookie is returned immediately. And if you want to store the cookie you can use get_variable, set_variable to store it in the Drupal database. There probably isn't much need for the logout function but I included it anyway for completeness:

Then there is the little question of when the cookie crumbles. Since the above code reads the "expires" string from the initial cookie response it is easy to store the expiration string along with the cookie (so I return an array from remote_login), and then compare the current time with this value whenever you retrieve it. If it has expired it needs to be renewed:

Monday, December 9, 2013

Splitting and joining compressed folders in Linux

Github has a file limit of 50MB, and I had a repository with an archive of compressed files that exceeded that. So I thought it might be nice if I could split the archive up into segments and then rejoin and decompress them when they were needed. But how do to this efficiently?

There is plenty of advice available on the Web, and initially I was attracted to tar, which has a means of creating archive sections, originally used to create files for separate tapes. But it doesn't compress and split at the same time, and you need to write a clumsy script to rename each of the sections. So I went back to the method everyone said was bad: split. All you have to do is create a tar archive in the usual way:

tar czf archive.tar.gz myfolder

Now split it into numbered sections:

split -a 1 -n 5 -d archive.tar.gz archive.tar.gz.

And what you end up with is a set of 5 more or less equal-sized files numbered archive.tar.gz.0, archive.tar.gz.1, etc. (StackOverflow suggests using a pipe, but then the segments have to be of a fixed size, since the stream length can't be measured.) If you need more than 10 segments (-n option) increase the -a option, which specifies the length of the suffix. To sew them back up again all you need is:

cat archive.tar.gz.* > archive.tar.gz

The reason this works is that the wildcard orders the files via the suffix, so they will be joined up in the correct order. Otherwise you'd have to specify each file. Now to decompress:

tar xzf archive.tar.gz

The only drawback with this method is that you need to know the number of segments in advance. This is not a problem for me, as I can put it into a script and adjust it when needed, which will be rarely, if ever.

Thursday, November 7, 2013

A simple blog page for Drupal 7

'Don't use the blog module! It turns your whole website into a blogging site!' Or: 'you're nuts if you don't use views' (then install twenty modules and spend the next six months learning how to use them!)

The Web is a great place for advice, isn't it? All I wanted was pretty simple: just a page with news items submitted by various users, a read more link and the capacity to comment on them. I doubted that the designers of Drupal hadn't already cracked this one, but you wouldn't know that from looking at some of the solutions posted on the Web. Though, admittedly, the designers of the blog module could have added a configure button ...

For those of you with similar ambitions, I suggest you at least try half a dozen mouse-clicks first:

  1. Enable the blog module. This turns on a blog entry type of content, which is exactly what you want.
  2. Go to Admin->Structure->Blog Entry->edit and deselect the 'promote to front page' option. Save.
  3. Go to http://<your-host>/?q=blog/ to add your first blog entry. Add this link to your front page and now you have a page of news items.

That's all there is to it. No rocket-science, no phpmyadmin hacking of the database, no fancy modules and courses in how to use Drupal views, no php. Sometimes I wonder why it is that, when people are faced by a problem they have no clue how to solve, that they start a competition to find the most complex answer to the simplest problem.

Tuesday, November 5, 2013

Using setenv.sh on tomcat6

If you want to set java runtime options for tomcat, a Java application, you can't do that in your web application, because by the time it gets called, the JVM has already been created and configured by Tomcat. So the Tomcat way around this is to let the user specify a shell script to be run from catalina.sh, which will be called to set CATALINA_OPTS, which will be passed to java. One of the gotchas associated with this method, however, is that catalina.sh, which is the script used by the OS on Linux to launch Tomcat uses the dash (/bin/sh) not the bash shell. If you, like foolish me, wrote your script in bash then it will not run. If you worse still invoke bash from within dash using a shebang #!/usr/bin/bash line the contents of your script will be ignored. So you have to write it in dash. I had to learn this the hard way, by tracing back uninformative error messages, and I don't want to forget it. So that is why I am recording it here.

Sunday, November 3, 2013

Native height, width of an image in jQuery

jQuery doesn't seem to provide a function for this, I guess because .nativeWidth is not consistently implemented across browsers. The reason I want to know the native width, that is, unscaled width of an image, is so that I can scale an imagemap that overlays it. Without knowing the precise ratio of the current scaling, the imagemap areas will be offset by the difference between the original area coordinates and the current positions and size of the image.

So my solution, which may not work for everyone, was to write the real image dimensions into the HTML in the first place. I did this via getimagesize in php. Then, by attaching an event handler in jQuery when the image first loads you can record the native width and height using custom attributes:

$( 'img' ).load(function() {
        $this.attr('nativew',$this.attr('width'));
        $this.attr('nativeh',$this.attr('height'));
    });
...// now retrieve them later:
w = $that.attr('nativew'),
h = $that.attr('nativeh');

Now I absolutely haven't tested this on more than Firefox and Chrome, but my hunch is that it will work on everything, because at bottom it relies on jQuery. The only problem for some will be writing the true image dimensions to the HTML. But it may not work with all image types. I used .png.

Wednesday, October 30, 2013

Login to or logout from Drupal in Java

I wanted to post and update a lot of biographical data to Drupal, but I could only do so by first logging in – manually. How tedious. So I thought that it would a good general tool to be able to programmatically login and logout of Drupal without having to go through the GUI each time. But how to do it?

The method I settled on was to use Java's HttpURLConnection class. I would open a http connection to Drupal, send it my username and password, get back a cookie and then send my stuff. When I had finished I could then programmatically logoff. Cool. But it turned out to be a bit more difficult than I had imagined. i used POST because I couldn't make GET work properly, and all applications seemed to use POST. Some of the parameters are probably not necessary, but I added them all from the packets I captured in Wireshark, and at least this way it works:

Tuesday, October 22, 2013

Mouseenter, mouseout on tablets

My website looked cool with its mouseenter and mouseleave events being fired as the cursor sped across the screen, until I tried it on an iPad. Mouseenter was simulated by tapping, and a double-tap opened the link like it was supposed to, but the mouseout event wasn't firing at all, so that tapping on another link didn't clear the state of the first link. So the program logic was broken. How to fix this I had no idea. Googling it just came up with opinions saying that basically there was nothing you could do. Come on! There's always something you can do. Mouseenter works fine via tap. Then all you have to do is refactor your code so that it doesn't need mouseout to work. So on mouseenter just test if there is already a highlighted link and if so, do the mouseout stuff. But keep the mouseout function because mice are still numerous in this laptop/tablet/phone world in which we all live.

Also remember that fingers are a lot less sharp than pointy cursors. Things to click on need to be of a good size. Tablet users are a bit like babies: they need big fat bightly coloured things to touch.

Monday, October 21, 2013

More Drupal trickery

CMSes are not rocket science but they are complex systems, and learning your way around a new one can be a bit tiresome. So I decided to record a few fixes that took me hours to get right in case anyone else has the same problems. Also I can go back to this page if I forget.

Getting rid of those pesky node titles

If, like me, you like a simplistic home page that is mostly graphical, you don't want a title taking up vertical space and spoiling your carefully crafted design. Urgh! Although some themes seem to allow you to turn off titles, if you don't have that luxury you can still turn them off by setting the link inside the h1 or h2 element to "display: none" in your theme's .css file. I noticed that the exact element enclosing the node titles changes even within the space of a few weeks, so be careful. On my localhost installation the following worked:

.title { display:none }

But on the server a slightly different Drupal version or configuration forced me to change this to:

h2 a { display:none }

And even that is a bit dodgy, since it hides all links inside h2 elements. But refine it if you will. I used the 'Inspector' in Firefox Dev tools to find out what display properties the element in question had, then I looked for those properties in the theme's .css files, and if they weren't there I wrote a new rule. Easy stuff, but it's satisfying to get it working.

Just the story title appearing, not the content

This trivial problem had me going for a couple of hours. All it was was the 'teaser only' setting in Structure->Content Types->Appearance->ManageDisplay->Custom Display Setttings. Wow, they really bury this stuff in the bowels. I just set it to "Full content" and I got my content back on the home page. Whew!

Drupal on Postgres

I decided to use Postgres because someone else was already using Mysql on the server and I didn't want to be accused of screwing up their database. Also, I didn't know the Mysql password :-). Half the advice on this topic is just rubbish. People suggest tinkering with things that Drupal does for you automatically. Like adding a $db_url line to settings.php. What utter rubbish -- the installer does that for you. All you have to do is ensure that you have pdo_pgsql and pgsql installed. (You can check this by putting a file containing just "<?php phpinfo();" and then accessing it over the Web, to get the server configuration. You should have two sections with exactly those titles). Once everything is installed correctly Postgres comes up as a database option when you install. Easy-peasy.

Tooltips that are not painfully slow

If you use image-maps and areas to highlight portions of an image it is nice to tell the users what's in store if they click on that link. You can add a tooltip using the title attribute on <area> but all the browsers display it only after about 3 dreary seconds, by which time the user's mouse has moved on. No doubt that prevents irritating little popups as the mouse skims over the page looking for something, but in this case it is just plain annoying. So a custom tooltip is called for. I found plenty of advice and implementations on the Web but only a few that worked with <area>s. In the end, not wanting to download a monstrous library that didn't work the way I wanted I just concocted my own. (I should have done that in the first place).

You start with a simple definition of a tooltip. If you feel interested you can add a pointy arrow or whatever, but I was feeling a bit lazy:

div.tooltip
{
    background-color: #FFFFCC;
    width: auto;
    height: 18px;
    position: absolute;
    padding: 5px;
    -moz-border-radius: 6px;
    -webkit-border-radius: 6px;
    border-radius: 6px;
    font: 15px times,serif;
    box-shadow: 4px 4px 4px #202020;
}

That gives you a boring tooltip-like box with a drop-shadow and flexible content. The text of the tooltip can then be specified in the area's alt attribute. The real trick was placing it on the screen in just the right place: centred under the area in question. The problem is that image-maps don't belong to the browser's 'box-model'. Areas can be regions and are placed within the image-map as only imagemaps know how. So browsers ignore them. If you ask for the position of an image-map jQuery will give you the bottom-right corner of the image, and the size of the map as 0, 0. The same goes for the areas themselves. Luckily you can parse the coords attribute of the areas and work out their x and y offsets and their height and width pretty easily. Then all you need is to find the offset (top, left) of the grandparent of the area to find out where to place the tooltip. The meat of it is:

function doNothing()
{
var text = $('.tooltip').text();
}
function initTooltips()
{
$('area').each(function()
{
    $(this).mouseenter(function() 
    {
        var desc = $(this).attr("alt");
        $(this).parent().parent().append( 
            "<div class=\"tooltip\">"+desc+"</div>" );
        var tooltip = $('.tooltip');
        if ( tooltip.size()>1 )
            tooltip.first().remove();
        /* allow new element to get a size */
        setTimeout(doNothing,0);
        var tooltip = $('.tooltip').last();
        var coords = $(this).attr("coords");
        var values = coords.split(",");
        var bot = 0;
        var left = 10000;
        var right = 0;
        for ( var i=0;i<values.length;i+=2 )
        {
            var x = parseInt(values[i]);
            var y = parseInt(values[i+1]);
            if ( bot < y )
                bot = y;
            if ( left > x )
                left = x;
            if ( x > right )
                right = x; 
        }
        var awidth = right-left;
        var offset = new Array();
        var ppoffset = tooltip.parent().offset();
        var ttwidth = tooltip.width();
        offset.top = ppoffset.top+bot+5;
        offset.left = (ppoffset.left + left + (awidth/2))-(ttwidth/2);  
        tooltip.offset( offset );
    });
    $(this).mouseout(function()
    {
        $('.tooltip').remove();
    });
});
}

You need to call the doNothing function to give the renderer a time-slice to calculate the dimensions of your new div. Otherwise it can end up returning a width of 0. Now just call the initTooltips function in your jQuery ready function and it will be awesome.

A different home page

If you want to have your home page use a different template from the rest of your site I'd say you're quite sane. After all, who would be mad enough to want it to be the same? The easiest way to do this is to create a page--front.tpl.php file inside the templates folder of your chosen theme. Then hack it until it looks right. That particular template file will then be used instead of the regular page.tpl.php file just for the home page. A victory for simplistic home pages everywhere.

Ugly "read more" link on home page

If you publish a page to the front page Drupal assumes that it is only a teaser. So it adds a "Read more" link, which takes you back to the home page. This is really a bug, but you can get around it by making that page the default home page in Configuration->System->Site Information. Just type in your node URL for the basic page on the home page and hey presto: no more annoying "Read more" link. (This also gives the node-titles the .title .class, which makes it easier to hide them). Sometimes I get the impression that the authors of Drupal have created something even they don't fully understand.

Saturday, October 19, 2013

Variable number of module configuration parameters in Drupal's hook_form

Drupal lets you define in a module a form for editing its configuration parameters but the documentation doesn't seem to explain how to do a variable number of parameters. For example, let's say you wanted to supply some fixed quotes to cycle through like a news ticker, but you wanted to let the user specify not only which quotes but how many*. But Drupal seems to require you to specify a fixed number of configuration parameters for your module.

There is an easy way around this. You can define a parameter like quote_ticker_numquotes, and set it to 6 by default:

function quote_ticker_form($form, &$form_state)
{
    $n = variable_get('quote_ticker_numquotes',6);
    if ( $n > 32 )
        $n = 32;
    elseif ( $n < 1 )
        $n = 1;
    $form['quote_ticker_numquotes'] = array(
        '#type' => 'textfield',
        '#title' => t('Number of quotes (1 to 32)'),
        '#default_value' => $n,
        '#size' => 2,
        '#maxlength' => 2,
        '#required' => TRUE
    );
    ...
}

Now, having set the number of desired quotes to $n it is a simple matter to create that many fields:

...
    for ( $i=1;$i<=$n;$i++ )
    {
        $form['quote_ticker_'.ordinal($i)] = array(
        '#type' => 'textfield',
        '#title' => t('Quote '.$i),
        '#default_value' => variable_get('quote_ticker_'.ordinal($i),
            ordinal($i).' quote <a href="">more...</a>'),
        '#size' => 64,
        '#maxlength' => 128,
        '#required' => TRUE
        );
    }
    return system_settings_form($form);
}

'ordinal' is just a simple function to turn a number into its ordinal name. Now, whenever the user changes the number of quotes the number of fields for defining quotes will expand or contract accordingly.

* The reason I decided to write a 'quote-ticker' was because news-tickers pull data from stories (nodes) whereas I wanted the quotes to be essentially fixed and to point to arbitrary URLs not just to Drupal pages.

Tuesday, October 1, 2013

Java garbage collection on Linux

Recently we had a problem with our Java web service. It ran a computationally expensive and memory intensive program that often ran out of memory for no apparent reason. Even though the files it was dealing with were around 70K each and it had 1GB of heap it ran and ran for 5 minutes and then failed with an out of memory error. So I ran the same program with exactly the same JVM settings on my MacOSX laptop with the same data and hey presto it ran flawlessly in one minute. Then I ran JProfiler on the two instances of the program - on OSX and Linux (Ubuntu 12.04). The JVM 1.6 settings were -Xmx1024m -Xss8m in both cases. Here is the OSX memory profile:

And here is the Linux memory profile during execution of the same program:

What seems to be going on here is that the garbage collector isn't being called often enough on Linux. And I know that the JVMs are not exactly the same but I did install the Oracle JVM, not the default OpenJDK on Linux but it made no difference. On the Oracle site they suggest trying the -Xincgc setting (incremental garbage collector). I added this to the JVM on Linux and magically the memory profile looked exactly like the one on OSX. It seems that those Apple guys aren't so dumb after all when it comes to tuning Java and their setup simply works better - for me at least. But you can get the same thing on Linux by adding one setting to your java invocation.

Thursday, June 27, 2013

Debugging dynamic libraries in NetBeans

When you are trying to debug a dynamic library in C or C++ in Netbeans it is frustrating not to be able to step into the code. Since I continue to forget how to set this up I have made some notes for future reference. Maybe other people will have the same problem.

  1. Build the dynamic library as a project in Netbeans
  2. Add it as a project in the project Properties->Build->Linker->Libraries dialog of the main application
  3. Set the library project to build Debug code
  4. Add the library's source folder in the main application to Project Properties->General->Source folders
  5. Add the library project to Properties->Related Projects of the main application project and check the "build" box. (optional)

Now rebuild the main project or just debug it and you should be able to step in to the library code.

Sunday, June 16, 2013

Installing Mango HMI on CentOS Tomcat6

I had a bit of trouble installing the Mango HMI on my CentOS box, but now I've solved it. It's a simple permissions problem, exacerbated by poor error messages in the catalina.log file on tomcat6. Googling the problem just turned up lots of other frustrated users, so I thought I'd wade in with the solution.

Here's the beef: You look in /usr/share/tomcat6/logs/localhost*.log. In there you'll see that the default Derby database didn't have write permission to /var/lib/tomcat6 and /etc/share/tomcat6/. You can find out who is running tomcat easily:

ps aux | grep tomcat

That tells me that "tomcat" is running tomcat. OK. So change the ownership of the two directories to "tomcat" and now restart tomcat6 and Bob's your uncle.

Sunday, May 26, 2013

Using hash tables to improve suffix tree performance

Ukkonen's suffix tree algorithm is clever, but what he doesn't explain is how to represent the nodes. Each node has potentially many children. This doesn't matter for genetic data and its four-letter alphabet, but for natural text each node can branch 100 ways or more, and these tend to be the most heavily used. The usual representation is via a linked list, but this can get pretty slow and compromises the linear performance of the algorithm for large file sizes. So over the past few days I have been experimenting with using hashtables for the bigger nodes, that is, those over a certain threshold, since this allows the correct child node to be selected in constant time. The hash function I'm using is pretty simple: I'm just indexing directly into the table using a single character as a key, after modding it over the number of buckets. (Actually this is pretty much optimal.) By experiment I discovered that the best point to switch from lists to hashtables is about 6 child nodes. Unfortunately, overall memory costs are increased by about 15%, and CPU consumption is not that drastically reduced, but more importantly it scales better at larger file-sizes. So for files of several megabytes, CPU cost would likely remain linear, whereas with the list representation it slowly increases. Memory usage is around 44 bytes per input text character on a 64 bit machine. (Remember that each pointer costs 8 bytes.) The reason I want this is so I can merge really big files without the CPU cost going through the roof.

Monday, February 18, 2013

ICU is cool

I have always disliked iconv. Both iconv and icu are designed to convert text in various character encodings into other encodings. iconv is normally installed on your computer (in OSX or Linux etc.), but icu is much easier to use. The problem is that iconv doesn't provide any easy way to compute the length of the destination buffer, whereas in icu this is trivial. For example, if I want to compute how long a buffer to contain text will be when I convert it from utf-16 all I do is pass 0 as the destination length and a NULL buffer and it tells me how long to make it:

Having called that function, I simply allocate a buffer of exactly that length, pass it into the conversion function again, and Bob's your uncle. The way to do this in iconv is to guess how big to make it, then reallocate the destination buffer as often as needed during the chunk by chunk conversion. Then you can read the number of characters converted. What a messy way to do it! I particularly like the fact that icu does NOT require you to specify a locale, as iconv does for some obscure reason. That limits the conversions you can do to those locales installed on your machine, and you have to guess which of them is appropriate for the current conversion. That's just nuts.

Tuesday, February 5, 2013

Ukkonen's suffix tree algorithm

A suffix tree is a data structure that facilitates the finding of any substring of length M in a text of length N. A substring of length 6 (M) can then be found in a text of a million characters (N) in time proportional to M. This is much faster than the best string-searching algorithms, which take time proportional to the length of the text. Building a suffix tree does take time and space proportional to N, but this only needs to be done once. This makes suffix trees a very useful data structure for bioinformatics and various other textual applications.

Section 1 gives an overview of how the algorithm works. The remaining sections describe the various components of the algorithm: the phases, extensions, finding the suffix of the previous phase, suffix links, skipping extensions and completing the tree. The discussion is backed up by working C code that includes a test suite, a tree-printing module, and gnuplot files for precisely documenting cpu and memory usage.

1. Overview

Suffix trees used to be built from right to left. For the string "banana" you would create an empty tree, then insert into it "a", then "na", "ana", "nana", "anana" and finally "banana". Ukkonen's idea was to build the tree left to right by adding all the suffixes of progressively longer prefixes of the string. So first he would insert all the suffixes of "b", then all the suffixes of "ba", "ban", "bana", banan" and finally "banana". This looks a lot less efficient than the right to left method, because it multiplies the number of tree-insertions by a factor of N. However, by using a number of tricks the time complexity reduces from N3 to N.

The algorithm divides the building into N phases, each containing a number of extensions. The phases correspond to building the implicit (unfinished) suffix tree for each prefix of the overall string. The extensions correspond to inserting each of the suffixes of each prefix into the tree.

-[R]-ban*
    |
    -an*
    |
    -n*

Implicit tree of "banana" after phase 2

Listing 1

The first implicit tree called I0 is constructed manually. The empty root node is created and a leaf constructed containing the string "b". The algorithm then proceeds through N additional phases in each of which the tree is expanded to the right by one character. The terminal NULL character (written "$") is added as a unique suffix at the end so we can distinguish the suffix "anana$" from "ana$" (otherwise "ana" would be a prefix of "anana"). The set_e function will be described in Section 7 below.

2. Phases

Listing 2

A phase is just a succession of extensions. The global e variable represents the index of the last character represented in the tree. So if e equals 5 this means that all leaves end in the "a" at the end of "banana", and 6 means that the tree is complete with its terminal $. And if e equals 0 this means that the 0th character ("b") is in the tree. So for each phase we increment e and update the length of every leaf automatically.

The phase algorithm calls the extension algorithm with successively shorter suffixes of the current prefix (j..i). The loop starts with the value of old_j, which is the last value that j had in the previous phase, starting with 0. This optimisation is explained in section 6 below.

If the extension function returns 0 then the rest of the current phase can be skipped. How this works will be explained in the next section.

3. Extensions

To understand the extension algorithm you have to be clear about how the tree's data structure works:

  1. A character range j..i means the character str[j] up to and including the character str[i].
  2. Since there is only one incoming edge per node in a tree we can represent edges as belonging to their nodes. The edges in nodes are defined by their start position in the string and their length. So a start position of 1 in "banana" is the character "a" and if its length was 3 this would represent the string "ana". This way of storing strings has 3 advantages:
    1. Each node has the same size
    2. The text can be of any type: even wide UTF-16 characters.
    3. Since the text is not copied into the nodes the overall storage is proportional to N.
  3. A particular character in the tree is uniquely identified by a pointer to its node and an index into the string. I call this a "pos" or position.

Listing 3

The first step is to find the position of the suffix j..i-1 which was inserted during the previous phase. Each such substring, called β in Gusfield, is now extended by one character, so that the substring j..i is added to the tree. There are three possibilities:

  1. If the substring β is at the end of some leaf we just extend the leaf by one. (This step is automatic when we update via the e global).
  2. β ends at a node or in the middle of an edge and the next, or ith character, is not yet in the tree. If it ends at the start of a node we create a new leaf and attach it to that node as a child. If it ends in the middle of an edge we split the edge by adding a new internal node and attaching the leaf to it.
  3. β ends at a node or in the middle of an edge and the next, or ith character, is already in the tree. Since we were proceeding left to right, it follows that all the remaining suffixes in this phase are also suffixes of that earlier copy of this substring and must already be extended. So there is nothing more to do now and we can abort the rest of the phase.

The update_old_beta function is explained in Section 6, and update_current_link is explained in Section 4.

4. Suffix links

Navigating the tree between branches instead of down from the root converts the basic algorithm from time proportional to N3 to N2. To make this possible suffix links must be created. These record the path followed by the extension algorithm as it moves through the tree during a phase. They are then used as short-cuts when constructing the extensions of the following phase. A suffix link is set to point to the current internal node from the last such node created or found by rules 2 or 3 (see Listing 3). When rule 3 ends the phase prematurely there must already be a path leading back to the root from that point in the tree. The following suffix links are defined for the suffix tree of "banana$":

link: ana -> na
link: na -> a
link: a -> [R]
-[R]-banana$
    |
    -a-na-na$
    | |  |
    | |  -$
    | |
    | -$
    |
    -na-na$
    |  |
    |  -$
    |
    -$

In Listing 4 the update_current_link function sets the link of the last seen internal node "current" to the value of the next internal node.

Listing 4

5. Finding β

For each iteration of the extension algorithm the correct position for the new suffix j..i is at the end of the existing suffix j..i-1, or β. This string can be found naively by walking down from the root starting at character offset j. However, this requires navigating through a maximum of N nodes for each extension. A shortcut that takes constant time is needed, and can be concocted by following the suffix links.

Figure 1: Walking across the tree

The last position of i-1 in each extension can be used to find its position in the next extension by following the nearest suffix link. The node immediately above the last position will not in fact contain a suffix link, because this hasn't been set yet. We must therefore go one node further up the tree (see Figure 1) to the first node with a suffix link, or to the root. In doing so we trace a path called γ. After arriving at the end of the link we then walk down the branch, where we will find an exact copy of γ, to the new position for the next extension. The journey is complicated by the use of indices of characters, not the characters themselves. Also, we may encounter multiple nodes on our journey down the next branch. Since the length of the journey is determined by the local distances between nodes and not the size of the tree, informally it is clear that the time required will be constant with respect to N.

Listing 5

Listing 5 shows an implementation of the algorithm. There are four possibilities:

  1. If this is the first extension in the phase we just use the last value of β, extended by one character, from the previous phase.
  2. A range where j > i indicates the empty string. (Recall that in find_beta the value of i is that of the previous phase). In this case we are trying to extend the root by a single character.
  3. If the suffix is the entire string (starting at 0) this means the longest leaf, which is always pointed to by f.
  4. In all other cases we walk across the tree by first locating the position of the previous j..i substring. Then we walk up at least one node or to the root, follow the link and walk down following the same textual path. (If we do reach the root we must discard γ, because it will be incorrect. In this case we just walk down naively from the root.) Walking up does not require us to make any choices at each node since there is always only one parent, but on the way down we require a path to follow so that the correct children are selected at each node. So we save the "path" (A simple data type containing an offset and a length) during the up-walk, and destroy it once we have walked down the other side.

6. Skipping extensions

We have already established in the previous section that the time taken for each extension is constant. However, the number of extensions per phase is still proportional to N. Linear time complexity is attained by reducing this to a constant also.

A true suffix tree has exactly N leaves for a text of length N. Since the only rule in the extension algorithm that creates leaves is rule 2, and since "once a leaf always a leaf" it follows that on average rule 2 must be applied exactly once per phase. Similarly, rule 3 can at most be applied once per phase. We have already observed that the use of the e global makes all applications of rule 1 redundant. So, informally, each phase will take constant time if we can just skip all the leaf extensions and start with the first necessary application of rule 2.

An examination of the program's execution reveals that the rules are applied in order for each phase: first a number of rule 1s, then rule 2 and finally rule 3 (if at all). The applications of rules 2 and 3 for the string "banana" are:

applying rule 2 at j=1 for phase 1
applying rule 2 at j=2 for phase 2
applying rule 3 at j=3 for phase 3
applying rule 3 at j=3 for phase 4
applying rule 3 at j=3 for phase 5
applying rule 2 at j=3 for phase 6
applying rule 2 at j=4 for phase 6
applying rule 2 at j=5 for phase 6
applying rule 2 at j=6 for phase 6

So we only have to remember the position of the last inserted suffix after each application of rule 2 or 3 and this can then be used instead of β at the start of the next phase. Also the value of j can be the last value it had in the previous phase. This trick allows us to skip most of the extensions and reduce their number per phase to a constant value.

Listing 6

We remember the last position of j..i in the previous phase by extending the position of β by the ith character, as shown in Listing 6.

6. Finalising the tree

Leaf-nodes are extended automatically by setting their length to "infinity", which for practical purposes, can be INT_MAX in C (2147483647). Whenever we request the end of such a node the answer will then be the current value of e. However, this is inconvenient for a finished tree, in which the lengths of all nodes should be correctly set. We can do this simply by recursing down the tree, looking for leaves and setting their lengths to e-node_start(v)+1. The time cost is proportional to N, but in addition to that already incurred, so overall the algorithm remains O(N).

Listing 7

7. Demonstration of linearity

Ukkonen's original algorithm did not specify how to represent the multi-branching nodes of the tree. The choice is linked lists or hashtables. It turns out that hashtables are not much better than lists, even for plain text, and use more memory. Running the test program for either pure list nodes or a mixture of hashtables (for branches > 6) or lists for smaller nodes confirms that the time taken does indeed increase linearly with file size:

Here is the memory usage comparing plain lists and a hashtable when the node size exceeds 6:

If you set MAX_LIST_CHILDREN to INT_MAX in tree.c and recompile you will get the list representation.

References

E. Ukkonen, 1995. Online construction of suffix trees. Algorithmica 14.3, 249–260. http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

D. Gusfield, 1997. Linear-time construction of suffix trees in Algorithms on strings, trees and sequences, Cambridge University Press. http://www.stanford.edu/~mjkay/gusfield.pdf

D. Schmidt, 2013. A C implementation of Ukkonen's suffix tree-building algorithm, with test suite and tree print.

Using hash tables to improve scalability of Ukkonen's algorithm