25 Apr

Blog Experiment: More SQL and Text Matching

In Viewer2, there is a feature that allows fast text searches on huge databases with tens of thousands of filenames (and associated photo captions and other meta data). Here’s how I did it.

The simple approach

The easiest way to find all filenames (the following article talks about filenames but you can use this to search for any text) matching $query (the string we are looking for) can be done a bit like this:

SELECT `filename` FROM `files` WHERE `filename` LIKE '%$query%'

Similarly, you could perform for multi-word searches by adding one test for each word you have extracted from the query string. And, you can very easily extend the syntax to include exclusions. Below is the equivalent SQL query for the query string [kitten -jpg] (filenames with the text ‘kitten’ but not the string ‘jpg’):

SELECT `filename` 
  FROM `files` 
  WHERE `filename` LIKE '%kitten%' 
    AND `filename` NOT LIKE '%jpg%'

The above works well but it can be very, very slow because it has to go through all entries in the table files to find matching rows. Speed can be improved by having queries like LIKE 'word%' but this limits the searches and in the end doesn’t give too much extra speed.

And if you are performing fuzzy searches…

Using a token-based approach

A lot of speed can be gained by splitting the filenames and the searched text into tokens and then performing the search using the tokens. I ended up using the following structure (SQLite syntax):

CREATE TABLE `files` (
  `file_id` INTEGER PRIMARY KEY AUTOINCREMENT,
  `filename` TEXT
);

CREATE TABLE `tokens` (
  `token_id` INTEGER PRIMARY KEY AUTOINCREMENT,
  `word` TEXT UNIQUE
);

CREATE TABLE `link_file_token` (
  `token_id` INT, -- Foreign key to the table `tokens`
  `file_id` INT   -- Foreign key to the table `files`
);

The idea is that each word in the filename can be represented with a single index to the `tokens` table. Thus, you can also search by a single integer instead of text comparison. You have to compare text only when generating the tokens (which is extremely helpful if you are using a comparatively expensive comparison).

First, when you insert the a file in the files table, you split the filename into words and link the file and the tokens (which may already exist in the `tokens` table). Later, when you need to perform a search, you do the same but use the generated tokens to search matching links.

The filename ‘Cute kitten picture.jpg‘ can be split in four tokens which results to four links between the token table and the file table. I also split words when the case between characters changes (splits filenames like ‘CuteKittenPicture.jpg‘) but this is up to you. In my implementation the tokens are also case-insensitive.

Searching using tokens

Here is a simple example query that finds matching files using the approach described above:

SELECT `filename` 
  FROM `files` 
  JOIN `link_file_token` USING `file_id` 
  WHERE `token_id` IN (
    SELECT `token_id` 
    FROM `tokens`
    WHERE `word` LIKE '%$query%'
  )

The subquery first finds all `token_id`‘s that match the query and then the main SELECT finds all filenames that have that `token_id` linked to them. Obviously, this restricts the query to simple OR queries. This is because the text matching is done in one word scope instead of a longer string (using the above filename as an example, a single word can’t match both ‘cat’ and ‘cute’).

Here is how you would do a query like [cat OR dog]:

SELECT `filename` 
  FROM `files` 
  JOIN `link_file_token` USING `file_id` 
  WHERE `token_id` IN (
    SELECT `token_id` 
    FROM `tokens`
    WHERE `word` LIKE '%cat%' OR `word` LIKE '%dog%'
  )

To perform an AND in the search query you need a query like this:

SELECT `filename` 
  FROM `files` 
  JOIN `link_file_token` ON `files`.`file_id` = `link_file_token`.`file_id`
    AND `token_id` IN (
      SELECT `token_id` 
      FROM `tokens`
      WHERE `word` LIKE '%kitten%'
    )
  JOIN `link_file_token` ON `files`.`file_id` = `link_file_token`.`file_id`
    AND `token_id` IN (
      SELECT `token_id` 
      FROM `tokens`
      WHERE `word` LIKE '%cute%'
    )

In the above example, using a natural JOIN handles the removal of all rows that only match one of the two searched words. Don’t forget that you can do the same in many ways. For example, you can use INTERSECT (do mind though that if you intersect two result lists of strings, it might be much more resource demanding than intersecting two result lists of integers).

Evaluation order

Keep in mind that if you combine the above AND and OR queries (by adding a OR `word` LIKE… for each OR’d word), it will not work as expected. Boolean algebra states the correct order to evaluate the two operators is AND first and then OR (the relation is similar to multiplication and addition, which may help you remember that). Above, it will work the opposite, the OR is evaluated before the AND.

If you restructure the query so that it, for example, returns indexes to the files table, you can implement correct evaluation order by adding a UNION for each OR term. In effect, you first use the above query to find each pair of terms AND’d and then combine all those queries with UNION.

For the query [cat cute OR dog drooly], the resulting SQL query will look something like this:

SELECT `filename` FROM `files` JOIN ... -- cat AND cute
UNION
SELECT `filename` FROM `files` JOIN ... -- dog AND drooly

Further ideas

To implement a NEAR operator, or when it is important that the searched words are in a certain order (in short strings like filenames it’s usually not at all necessary), you could add a field in the file-token links that tell the index of the word in the original string. When you have searched for the results using the above method, you then score the results based on the distance between the tokens.

E.g. if you are looking for [cute kitten], and a result has the text ‘Here is a cute dog. It is not a kitten’, the indexes for the words ‘cute’ and ‘kitten’ will be 4 and 10. Thus, the result should have a worse score than some other piece of text with the exact word order.

Notes

In Viewer2, the exact implementation doesn’t use subqueries like described above. Instead, I first generate a longer query that has the needed `token_id`’s and other stuff filled in (a comma separated list of integers in place of the SELECT subquery). You might consider this to gain some more speed but do experiment before you optimize.

Take note that even SQLite has a similar feature called FTS2. In theory, it should provide most if not all of the above. My example is still useful if you need more control over how the text is tokenized and compared (and I think in the case of FTS2 enabling full text searching requires double the storage, if you don’t want to store all the text data in the full text search table).

As always, research before you start doing something you think someone probably already did.

20 Apr

Blog Experiment: Comparing text and SQL

A lot of visitors come from search engines to my page about a little hack I made that acts as a very thin layer between text files and SQLite. Most of the search queries are something akin to “compare text files in SQL“, “SQL text compare” or other SQL and text comparison related searches. So, I’m going to write what I know about the subject.

Comparing text in SQL

Most implementations of SQL can do text comparison in a variety of ways. It is safe to assume any two text strings can be compared with the equal operator and the LIKE operator. The difference between the two basic operators is that the equal operator usually is case-sensitive by default (see below before assuming this is always true) while LIKE is case-insensitive. LIKE also has a nice feature: wildcards. You can check if a string contains a shorter substring.

1   SELECT `name` FROM `table` WHERE `name`='Joe Doe';
2   SELECT `name` FROM `table` WHERE `name` LIKE 'Joe Doe';
3   SELECT `name` FROM `table` WHERE `name`='% Doe';

All the above queries return the rows in which the field `name` (1) is exactly ‘Joe Doe’, not e.g. ‘joe DoE’, (2) the same as (1) but case-insensitive and finally (3) ends with the string ‘% Doe’ (case-insensitive).

As you can see, the percent string acts as a wildcard which can be substituted with any sequence of characters – much like the asterisk (*) when talking about e.g. file paths. The underscore (_) character can be replaced by any one character, this is comparable to what question mark usually does with filenames and paths. The wildcard characters can be different between implementations.

The case-sensitivity can be changed. For example, the equal operator can be case-insensitive if the collation is set to a case-insensitive method. Similarly, the LIKE operator can be set to be-case sensitive. See your SQL implementation manual how to do this.

When it comes to speed, you should always use the equal operator before LIKE. In cases like the below query it is impossible to benefit from an index:

SELECT `name` FROM `table` WHERE `name` LIKE '%Doe%'

Why? Because the substring ‘Doe’ can be anywhere in the compared string. However, most SQL implementations can gain speed from indexes when you have a query something like this:

SELECT `name` FROM `table` WHERE `name` LIKE 'Joe%'

You can think of the speed-up as something similar to trying to find a word in a dictionary. If you have a table of contents, you don’t need to look through every page: quickly flipping pages to a known page (words starting with ‘Joe’) and then finding the word on the page is a lot faster. Again, I’m sure different databases have different limitations.

Advanced comparison

One very nice feature of MySQL is that it has a special kind of comparison built in: regular expressions. The REGEXP operator is used like the LIKE operator but the string after the operator is — a regular expression.

SELECT `name` FROM `table` WHERE `name` REGEXP '^[JM]oe Doe$'

The above example matches names ‘Joe Doe’ and ‘Moe Doe’. Regular expressions of course can be much more complex than that but they are useful when you need just a little bit more functionality than what LIKE can offer.

Fuzzy comparison

Another feature that comes in my mind is PostgreSQL’s pg_trgm module that allows fuzzy text comparison based on trigrams (i.e. word is divided into sets of three letters, ‘car’ would result in trigrams ‘ ca’, ‘car’ and ‘ar ‘ — with the spaces). As far as I remember, it comes with the basic installation so it should be simple to install. Basically, you can then compare strings that are not exactly like each other but within a certain percentage.

SELECT `name` FROM `table` WHERE similarity('Joe Doe',`name`)>0.5

The above would match all names that are relatively similar (more than 50 percent) to ‘Joe Doe’. For example, ‘John Doe’ would most likely match.

And yet again, MySQL and other DB’s could have the same feature. And I intentionally left out the phonetic (“sounds like”) comparison features that are very often built-in (AFAIK MySQL and PHP should have those included by default). This article knows MySQL’s “sounds like” operator is quite logically SOUNDS LIKE.

Text files and SQL

A lot of search queries are looking for something that combines text files and SQL. As far as I know, there is no SQL engine that writes and reads raw text files as the tables (other than my hack mentioned in the first paragraph).

However, at least what I have tooled with the idea, it can be quite useful. Bundle that with the fuzzy and regexp comparison from the bigger databases and it’s a killer idea right there.

Dumping tables into text files

Most databases come with a tool that can dump (or export) the database tables into SQL statements that can then be used on another table to duplicate the structure and the data. It is quite safe to say most bigger databases have their own command line tools that can do this, a visual database tool such as phpMyAdmin for MySQL and of course numerous desktop applications for that.

I use SQLyog for MySQL tinkering. In SQLyog it is as simple as selecting the table (or database) and then clicking DB/Table > Export as SQL Statements and soon you have a text file (or XML file or…). And doing it backwards is as simple as copypasting the text file contents and running the statements.

Checking if two tables are identical

Here’s a method for comparing two tables for identical data is to inner join both tables (returns only matching rows) and checking if the number of rows is equal to the total number of rows in a table. The idea is that we use every field to be checked as a foreign key to the other table.

SELECT COUNT(*)/(SELECT COUNT(*) FROM `myFriends`) as `equality` 
  FROM `myFriends` as a, `yourFriends` as b
  WHERE a.`name`=b.`name` AND a.`phoneNumber`=b.`phoneNumber`

Column `equality` in the query result is a percentage of how many matching lines there are compared to the total row count. The above won’t care about the order of the rows which may or may not be useful (unless you include something like an auto incrementing key in the comparison).

There, another almost useful page added to the Internets. I hope the above will be helpful at least to some random visitors.

11 Nov

Compare text files in SQL

Here’s a quick hack that allows loading text files as tables in a sqlite database. Why? It’s pretty nice to compare files in SQL.

txtsql can be run as follows: txtsql.exe file.txt file2.txt ...

Every text file is loaded as a table with two columns: row (line number) and data (the actual text). When saved with save and the first column is named row, it will be skipped. Thus, every loaded table and ever table created with new will be saved as a normal text file. Any other table will be saved with tab characters separating the cells.

The program is just a test to see if this is a viable concept (post a comment if you think it is). If it seems to be something to develop further, I’ll add something like transparent loading and saving of text files as tables when they are invoked (now you have to load them with the load command), common text operations (fuzzy searching etc.) and such.

Examples

Count lines:

load `file.txt`
select count(*) from `file.txt`

Save unique lines in a new file:

load `file.txt`
new `uniq.txt`
insert into `uniq.txt` (data) select distinct data from `file.txt`
save `uniq.txt`

Find similar lines:

load `1.txt`
load `2.txt`
select data from `1.txt` where data in (select data from `2.txt`)

txtsql.zip – Binaries (Win32) and source code