Blog Experiment: More SQL and Text Matching

In Viewer2, there is a feature that allows fast text searches on huge databases with tens of thousands of filenames (and associated photo captions and other meta data). Here’s how I did it.

The simple approach

The easiest way to find all filenames (the following article talks about filenames but you can use this to search for any text) matching $query (the string we are looking for) can be done a bit like this:

SELECT `filename` FROM `files` WHERE `filename` LIKE '%$query%'

Similarly, you could perform for multi-word searches by adding one test for each word you have extracted from the query string. And, you can very easily extend the syntax to include exclusions. Below is the equivalent SQL query for the query string [kitten -jpg] (filenames with the text ‘kitten’ but not the string ‘jpg’):

SELECT `filename` 
  FROM `files` 
  WHERE `filename` LIKE '%kitten%' 
    AND `filename` NOT LIKE '%jpg%'

The above works well but it can be very, very slow because it has to go through all entries in the table files to find matching rows. Speed can be improved by having queries like LIKE 'word%' but this limits the searches and in the end doesn’t give too much extra speed.

And if you are performing fuzzy searches…

Using a token-based approach

A lot of speed can be gained by splitting the filenames and the searched text into tokens and then performing the search using the tokens. I ended up using the following structure (SQLite syntax):

CREATE TABLE `files` (
  `file_id` INTEGER PRIMARY KEY AUTOINCREMENT,
  `filename` TEXT
);

CREATE TABLE `tokens` (
  `token_id` INTEGER PRIMARY KEY AUTOINCREMENT,
  `word` TEXT UNIQUE
);

CREATE TABLE `link_file_token` (
  `token_id` INT, -- Foreign key to the table `tokens`
  `file_id` INT   -- Foreign key to the table `files`
);

The idea is that each word in the filename can be represented with a single index to the `tokens` table. Thus, you can also search by a single integer instead of text comparison. You have to compare text only when generating the tokens (which is extremely helpful if you are using a comparatively expensive comparison).

First, when you insert the a file in the files table, you split the filename into words and link the file and the tokens (which may already exist in the `tokens` table). Later, when you need to perform a search, you do the same but use the generated tokens to search matching links.

The filename ‘Cute kitten picture.jpg‘ can be split in four tokens which results to four links between the token table and the file table. I also split words when the case between characters changes (splits filenames like ‘CuteKittenPicture.jpg‘) but this is up to you. In my implementation the tokens are also case-insensitive.

Searching using tokens

Here is a simple example query that finds matching files using the approach described above:

SELECT `filename` 
  FROM `files` 
  JOIN `link_file_token` USING `file_id` 
  WHERE `token_id` IN (
    SELECT `token_id` 
    FROM `tokens`
    WHERE `word` LIKE '%$query%'
  )

The subquery first finds all `token_id`‘s that match the query and then the main SELECT finds all filenames that have that `token_id` linked to them. Obviously, this restricts the query to simple OR queries. This is because the text matching is done in one word scope instead of a longer string (using the above filename as an example, a single word can’t match both ‘cat’ and ‘cute’).

Here is how you would do a query like [cat OR dog]:

SELECT `filename` 
  FROM `files` 
  JOIN `link_file_token` USING `file_id` 
  WHERE `token_id` IN (
    SELECT `token_id` 
    FROM `tokens`
    WHERE `word` LIKE '%cat%' OR `word` LIKE '%dog%'
  )

To perform an AND in the search query you need a query like this:

SELECT `filename` 
  FROM `files` 
  JOIN `link_file_token` ON `files`.`file_id` = `link_file_token`.`file_id`
    AND `token_id` IN (
      SELECT `token_id` 
      FROM `tokens`
      WHERE `word` LIKE '%kitten%'
    )
  JOIN `link_file_token` ON `files`.`file_id` = `link_file_token`.`file_id`
    AND `token_id` IN (
      SELECT `token_id` 
      FROM `tokens`
      WHERE `word` LIKE '%cute%'
    )

In the above example, using a natural JOIN handles the removal of all rows that only match one of the two searched words. Don’t forget that you can do the same in many ways. For example, you can use INTERSECT (do mind though that if you intersect two result lists of strings, it might be much more resource demanding than intersecting two result lists of integers).

Evaluation order

Keep in mind that if you combine the above AND and OR queries (by adding a OR `word` LIKE… for each OR’d word), it will not work as expected. Boolean algebra states the correct order to evaluate the two operators is AND first and then OR (the relation is similar to multiplication and addition, which may help you remember that). Above, it will work the opposite, the OR is evaluated before the AND.

If you restructure the query so that it, for example, returns indexes to the files table, you can implement correct evaluation order by adding a UNION for each OR term. In effect, you first use the above query to find each pair of terms AND’d and then combine all those queries with UNION.

For the query [cat cute OR dog drooly], the resulting SQL query will look something like this:

SELECT `filename` FROM `files` JOIN ... -- cat AND cute
UNION
SELECT `filename` FROM `files` JOIN ... -- dog AND drooly

Further ideas

To implement a NEAR operator, or when it is important that the searched words are in a certain order (in short strings like filenames it’s usually not at all necessary), you could add a field in the file-token links that tell the index of the word in the original string. When you have searched for the results using the above method, you then score the results based on the distance between the tokens.

E.g. if you are looking for [cute kitten], and a result has the text ‘Here is a cute dog. It is not a kitten’, the indexes for the words ‘cute’ and ‘kitten’ will be 4 and 10. Thus, the result should have a worse score than some other piece of text with the exact word order.

Notes

In Viewer2, the exact implementation doesn’t use subqueries like described above. Instead, I first generate a longer query that has the needed `token_id`’s and other stuff filled in (a comma separated list of integers in place of the SELECT subquery). You might consider this to gain some more speed but do experiment before you optimize.

Take note that even SQLite has a similar feature called FTS2. In theory, it should provide most if not all of the above. My example is still useful if you need more control over how the text is tokenized and compared (and I think in the case of FTS2 enabling full text searching requires double the storage, if you don’t want to store all the text data in the full text search table).

As always, research before you start doing something you think someone probably already did.

kometbomb

Programming and stuff.