Sunday, May 26, 2013

Using hash tables to improve suffix tree performance

Ukkonen's suffix tree algorithm is clever, but what he doesn't explain is how to represent the nodes. Each node has potentially many children. This doesn't matter for genetic data and its four-letter alphabet, but for natural text each node can branch 100 ways or more, and these tend to be the most heavily used. The usual representation is via a linked list, but this can get pretty slow and compromises the linear performance of the algorithm for large file sizes. So over the past few days I have been experimenting with using hashtables for the bigger nodes, that is, those over a certain threshold, since this allows the correct child node to be selected in constant time. The hash function I'm using is pretty simple: I'm just indexing directly into the table using a single character as a key, after modding it over the number of buckets. (Actually this is pretty much optimal.) By experiment I discovered that the best point to switch from lists to hashtables is about 6 child nodes. Unfortunately, overall memory costs are increased by about 15%, and CPU consumption is not that drastically reduced, but more importantly it scales better at larger file-sizes. So for files of several megabytes, CPU cost would likely remain linear, whereas with the list representation it slowly increases. Memory usage is around 44 bytes per input text character on a 64 bit machine. (Remember that each pointer costs 8 bytes.) The reason I want this is so I can merge really big files without the CPU cost going through the roof.