Tuesday, January 19, 2016

Passing parameters to javascript via drupal_add_js

Nowadays most Web programs contain a significant Javascript (or JQuery) component but content management systems are still run in PHP, Java, python etc. In the case of the popular CMS Drupal, that language is PHP, and it would be rather nice if we had a reliable mechanism for passing arguments to a javascript file so that information would be available as parameters to customise its functionality. Unfortunately there is no such mechanism. Various hacks have been proposed and used within Drupal itself, so one can implant something like <script src="myscript.js?arg1=foo&arg2=bar"></script> and expect it to work, with the aid of a script that locates the script element, strips out the arguments and then passes them in javascript to a javascript function within myscript.js. Here's an example function that does it:

Using drupal_add_js

Another technique is to install a javascript file before the page is fully loaded so it will get executed when loading is complete. For this purpose the function drupal_add_js comes into play. It has to be called within a hook_init or hook_preprocess_page function. But since parameters to javascript files are not really allowed, this function escapes any content like '&', '=' or '?' into %26, %3D, %3F, so turning a request for a javascript file into something that is simply not there.

A workaround I used successfully is to precede the call to drupal_add_js('myscript.js','file') with another call to drupal_add_js(...,'inline');. In this way I managed to store the parameters in browser local storage and retrieve them similarly. e.g. for the inline call I used: drupal_add_js("localStorage.setItem('module_params', 'docid=english/harpur/h080&target=tabs-content')",'inline'); And to retrieve these values a function similar to the one above works just fine:

Sunday, January 10, 2016

Compressing an unsorted list of integers

You see a lot on offer for compressing sorted lists of integers, but not so much on the rather uncomfortable question of how to compress a list of unsorted numbers. So why would you want to do that? Surely all one needs to do is sort the list then compress it? The problem is that in many cases the order of the integers represents information that sorting would destroy. Here is my case in point.

Actual hit-positions in a search engine

In a search-engine you have to make a list of documents in which a term is found. Each document is assigned an identifier. So say we had a list of documents in which the word 'dog' occurs. It might be found in 7 documents: 0,1,1,2,2,3,4,6,11,21,21. We can use an algorithm like FastPFOR because the sorted list can be converted into a list of deltas, or the differences between successive entries. These will typically be much shorter numbers than the actual values. In my case an array of 100 document identifiers compressed down to just 13 integers. Cool. But what if you wanted to store the locations in those documents where the word 'dog' occurred as well? This would bloat the index considerably, since 'dog' might occur 100 times in a single document. I could sort and compress it the same way, but quite often it would be really short, maybe only one entry. Then the compression algorithm would actually increase the list size by three or four-fold, since the overhead for a compressed list adds a number of ints to the start, and – this is the real killer – one compressed list would be needed for each document the word was found in. So trying to compress it the conventional way would first, probably increase the overall index size, and second, you would need to maintain a lot of compressed lists.

The solution

Ideally, we would like just one list of word-positions for each word, just as we had one list of document identifiers for each word. But such a list would have to be unsorted, because if we sort it we would lose the information about which documents those positions refer to. Fortunately, most positions are quite small. They can't be greater then the length of the longest document the word is found in. Or if it was found in only a few documents, or always at the start, the values would be even smaller. Lets say that all the values are less than some amount like 127. (Convenient, huh?) Then the list could be stored in 8-bit integers with one 32-bit integer at the start to say how many bytes there were per integer. Or if 16 bits were needed, then we could use 2-byte ints, or 3 bytes for 24 bits etc. Worst case is when the documents are bigger than 4MB or so, when we will need 32-bit integers. But that's rare.

So the strategy is pretty simple. Scan all the numbers first to see what is the biggest (or smallest negative number) and work out how many bits we will need. I kind of cheat by rounding this up to the nearest 8 bits, but if you're interested you can refine it to have an arbitrary number of bits. But you won't gain much in compression and you will lose something in speed. Here's my Java code. It just uses ByteBuffer to build arbitrary-sized ints up to 32 bits, in 8-bit hops, and then stores the list as an array of 32-bit ints – compressed, of course. So typically what you'd expect is a 25%-50% reduction in the array size and in some cases 75%. Compressing it further by any significant amount seems to be impossible, given the near-random nature of the data.

I offer no guarantees that this works for all cases etc. But it is freeware. Do what you like with it.