April | 2008 | kometbomb

25 Apr

Blog Experiment: More SQL and Text Matching

In Viewer2, there is a feature that allows fast text searches on huge databases with tens of thousands of filenames (and associated photo captions and other meta data). Here’s how I did it.

The simple approach

The easiest way to find all filenames (the following article talks about filenames but you can use this to search for any text) matching $query (the string we are looking for) can be done a bit like this:

SELECT `filename` FROM `files` WHERE `filename` LIKE '%$query%'

Similarly, you could perform for multi-word searches by adding one test for each word you have extracted from the query string. And, you can very easily extend the syntax to include exclusions. Below is the equivalent SQL query for the query string [kitten -jpg] (filenames with the text ‘kitten’ but not the string ‘jpg’):

SELECT `filename` 
  FROM `files` 
  WHERE `filename` LIKE '%kitten%' 
    AND `filename` NOT LIKE '%jpg%'

The above works well but it can be very, very slow because it has to go through all entries in the table files to find matching rows. Speed can be improved by having queries like LIKE 'word%' but this limits the searches and in the end doesn’t give too much extra speed.

And if you are performing fuzzy searches…

Using a token-based approach

A lot of speed can be gained by splitting the filenames and the searched text into tokens and then performing the search using the tokens. I ended up using the following structure (SQLite syntax):

CREATE TABLE `files` (
  `file_id` INTEGER PRIMARY KEY AUTOINCREMENT,
  `filename` TEXT
);

CREATE TABLE `tokens` (
  `token_id` INTEGER PRIMARY KEY AUTOINCREMENT,
  `word` TEXT UNIQUE
);

CREATE TABLE `link_file_token` (
  `token_id` INT, -- Foreign key to the table `tokens`
  `file_id` INT   -- Foreign key to the table `files`
);

The idea is that each word in the filename can be represented with a single index to the `tokens` table. Thus, you can also search by a single integer instead of text comparison. You have to compare text only when generating the tokens (which is extremely helpful if you are using a comparatively expensive comparison).

First, when you insert the a file in the files table, you split the filename into words and link the file and the tokens (which may already exist in the `tokens` table). Later, when you need to perform a search, you do the same but use the generated tokens to search matching links.

The filename ‘Cute kitten picture.jpg‘ can be split in four tokens which results to four links between the token table and the file table. I also split words when the case between characters changes (splits filenames like ‘CuteKittenPicture.jpg‘) but this is up to you. In my implementation the tokens are also case-insensitive.

Searching using tokens

Here is a simple example query that finds matching files using the approach described above:

SELECT `filename` 
  FROM `files` 
  JOIN `link_file_token` USING `file_id` 
  WHERE `token_id` IN (
    SELECT `token_id` 
    FROM `tokens`
    WHERE `word` LIKE '%$query%'
  )

The subquery first finds all `token_id`‘s that match the query and then the main SELECT finds all filenames that have that `token_id` linked to them. Obviously, this restricts the query to simple OR queries. This is because the text matching is done in one word scope instead of a longer string (using the above filename as an example, a single word can’t match both ‘cat’ and ‘cute’).

Here is how you would do a query like [cat OR dog]:

SELECT `filename` 
  FROM `files` 
  JOIN `link_file_token` USING `file_id` 
  WHERE `token_id` IN (
    SELECT `token_id` 
    FROM `tokens`
    WHERE `word` LIKE '%cat%' OR `word` LIKE '%dog%'
  )

To perform an AND in the search query you need a query like this:

SELECT `filename` 
  FROM `files` 
  JOIN `link_file_token` ON `files`.`file_id` = `link_file_token`.`file_id`
    AND `token_id` IN (
      SELECT `token_id` 
      FROM `tokens`
      WHERE `word` LIKE '%kitten%'
    )
  JOIN `link_file_token` ON `files`.`file_id` = `link_file_token`.`file_id`
    AND `token_id` IN (
      SELECT `token_id` 
      FROM `tokens`
      WHERE `word` LIKE '%cute%'
    )

In the above example, using a natural JOIN handles the removal of all rows that only match one of the two searched words. Don’t forget that you can do the same in many ways. For example, you can use INTERSECT (do mind though that if you intersect two result lists of strings, it might be much more resource demanding than intersecting two result lists of integers).

Evaluation order

Keep in mind that if you combine the above AND and OR queries (by adding a OR `word` LIKE… for each OR’d word), it will not work as expected. Boolean algebra states the correct order to evaluate the two operators is AND first and then OR (the relation is similar to multiplication and addition, which may help you remember that). Above, it will work the opposite, the OR is evaluated before the AND.

If you restructure the query so that it, for example, returns indexes to the files table, you can implement correct evaluation order by adding a UNION for each OR term. In effect, you first use the above query to find each pair of terms AND’d and then combine all those queries with UNION.

For the query [cat cute OR dog drooly], the resulting SQL query will look something like this:

SELECT `filename` FROM `files` JOIN ... -- cat AND cute
UNION
SELECT `filename` FROM `files` JOIN ... -- dog AND drooly

Further ideas

To implement a NEAR operator, or when it is important that the searched words are in a certain order (in short strings like filenames it’s usually not at all necessary), you could add a field in the file-token links that tell the index of the word in the original string. When you have searched for the results using the above method, you then score the results based on the distance between the tokens.

E.g. if you are looking for [cute kitten], and a result has the text ‘Here is a cute dog. It is not a kitten’, the indexes for the words ‘cute’ and ‘kitten’ will be 4 and 10. Thus, the result should have a worse score than some other piece of text with the exact word order.

Notes

In Viewer2, the exact implementation doesn’t use subqueries like described above. Instead, I first generate a longer query that has the needed `token_id`’s and other stuff filled in (a comma separated list of integers in place of the SELECT subquery). You might consider this to gain some more speed but do experiment before you optimize.

Take note that even SQLite has a similar feature called FTS2. In theory, it should provide most if not all of the above. My example is still useful if you need more control over how the text is tokenized and compared (and I think in the case of FTS2 enabling full text searching requires double the storage, if you don’t want to store all the text data in the full text search table).

As always, research before you start doing something you think someone probably already did.

22 Apr

Blog Experiment: Desaturating and resizing images

Resizing images

Resizing an image is an interesting subject. There are many ways to do it and you can also do the same thing in very different methods. Quickly summarized, I’d say resizing an image can be categorized like this:

Enlarging
- Resampling
  - Nearest neighbor (blocky)
  - Linear interpolation (blurry)
  - Spline interpolation (e.g. cubic, lanzcos etc., blurry, sometimes exaggerates contrast and makes edges stand out too much)
- Super-resolution (“CSI” style zooming)
Shrinking
- Resampling
  - Nearest neighbor (blocky)
  - Linear interpolation (average)
  - Spline interpolation (sometimes better than average, especially when the resolution isn’t for example, halved or divided by three)
- Re-targeting (“content aware resizing”, keeps important details unchanged)

Now, for usual resizing there are many libraries that do it for you. There is no point recreating the same algorithms over and over, considering you will most likely use either bi-linear interpolation, a spline algorithm or some special algorithm. When using SDL, I have used SDL_gfx to resize images both larger and smaller. It does most basic interpolation, for slight resizing it should be good enough.

Another free piece of code for resizing is the Super Eagle algorithm which was specifically created for resizing emulator (Snes9x) graphics larger: it doesn’t blur edges and works quite well removing “jaggy” lines. It should be available for [your preferred graphics library], it has been adapted to so many projects that it is only very probable that there is an implementation floating around for you to steal.

As for commercial implementations of super-resolution, this looks pretty awesome. As does this online bitmap image to vector image converter (vector images do not suffer of the loss of sharpness when enlarged).

Do keep in mind that you can also use the video card of your computer to scale images: most (pretty much all) computers have some 3D features which means they can be used to quickly draw an image on the screen in any dimensions you like, even if you don’t need interactivity. I use this method in Viewer2 and I found it easy to implement. Also, you can shrink images the same way, usually a graphic library will have a function that generates mipmaps which give very nice results when the drawn image is smaller than the original.

And now the reason why I started writing these (lame) blog entries: image re-targeting or seam carving. My implementation works somehow (mainly with cartoon style images with as little color gradients as possible) but there are many other implementations that are good.

Liquid Rescale – seam carving for GIMP (open source, please, for the love of “Bob”, look into this source code rather than mine).
A good list of different implementations – sadly with mine, too.
Seam carving is as notable an algorithm that Wikipedia acknowledges its existence

Desaturating (greyscaling) images

Desaturating (i.e. making color images black and white) is a subject that interests a portion of the people who find their way on my site. While this is a trivial problem to solve in any programming language, there actually are some quirks to it. Here goes.

This is the easiest (and probably fastest) method:

Grey = (R + G + B) / 3;

However, when speed is not important or if you simply are more pedantic than you are lazy, a more accurate mixing ratio is as follows:

Grey = 0.3 * R + 0.59 * G + 0.11 * B;

This is because the human eye isn’t equally sensitive to all three color components. While those three ratios are the most often quoted, each eye probably needs a slightly tweaked mix. But the main idea is that we are able to see green much better than red, and red a bit better than blue.

Lossy image compression uses this fact to dedicate more bits to the colors to which the eye is more sensitive. Also, back when 24-bit resolutions weren’t quite yet available, the 15 or 16-bit modes often dedicated less bits to the blue component.

Worth noting is that if you want the correct algorithm to be fast, you will probably need to ditch the floating point operations and do the conversion with integers a bit like this:

Grey = (77 * R + 151 * G + 28 * B) / 256;

In any proper language and with a decent compiler, the above should compile to code that doesn’t have a division but a bit shift for speed (2^8 = 256, shifting right by 8 bits is the same as dividing by 256). Though, if you do that, make sure your variables are wide enough so they won’t overflow.

Do keep in mind any modern processor has vector processing capabilities. That is, you desaturate multiple pixels in one operation. And, that in such case it could be possible to do the conversion quite fast with the first formula that uses floating points. Then again, if you are able to use such features, you probably didn’t need to read this.

20 Apr

Blog Experiment: Comparing text and SQL

A lot of visitors come from search engines to my page about a little hack I made that acts as a very thin layer between text files and SQLite. Most of the search queries are something akin to “compare text files in SQL“, “SQL text compare” or other SQL and text comparison related searches. So, I’m going to write what I know about the subject.

Comparing text in SQL

Most implementations of SQL can do text comparison in a variety of ways. It is safe to assume any two text strings can be compared with the equal operator and the LIKE operator. The difference between the two basic operators is that the equal operator usually is case-sensitive by default (see below before assuming this is always true) while LIKE is case-insensitive. LIKE also has a nice feature: wildcards. You can check if a string contains a shorter substring.

1   SELECT `name` FROM `table` WHERE `name`='Joe Doe';
2   SELECT `name` FROM `table` WHERE `name` LIKE 'Joe Doe';
3   SELECT `name` FROM `table` WHERE `name`='% Doe';

All the above queries return the rows in which the field `name` (1) is exactly ‘Joe Doe’, not e.g. ‘joe DoE’, (2) the same as (1) but case-insensitive and finally (3) ends with the string ‘% Doe’ (case-insensitive).

As you can see, the percent string acts as a wildcard which can be substituted with any sequence of characters – much like the asterisk (*) when talking about e.g. file paths. The underscore (_) character can be replaced by any one character, this is comparable to what question mark usually does with filenames and paths. The wildcard characters can be different between implementations.

The case-sensitivity can be changed. For example, the equal operator can be case-insensitive if the collation is set to a case-insensitive method. Similarly, the LIKE operator can be set to be-case sensitive. See your SQL implementation manual how to do this.

When it comes to speed, you should always use the equal operator before LIKE. In cases like the below query it is impossible to benefit from an index:

SELECT `name` FROM `table` WHERE `name` LIKE '%Doe%'

Why? Because the substring ‘Doe’ can be anywhere in the compared string. However, most SQL implementations can gain speed from indexes when you have a query something like this:

SELECT `name` FROM `table` WHERE `name` LIKE 'Joe%'

You can think of the speed-up as something similar to trying to find a word in a dictionary. If you have a table of contents, you don’t need to look through every page: quickly flipping pages to a known page (words starting with ‘Joe’) and then finding the word on the page is a lot faster. Again, I’m sure different databases have different limitations.

Advanced comparison

One very nice feature of MySQL is that it has a special kind of comparison built in: regular expressions. The REGEXP operator is used like the LIKE operator but the string after the operator is — a regular expression.

SELECT `name` FROM `table` WHERE `name` REGEXP '^[JM]oe Doe$'

The above example matches names ‘Joe Doe’ and ‘Moe Doe’. Regular expressions of course can be much more complex than that but they are useful when you need just a little bit more functionality than what LIKE can offer.

Fuzzy comparison

Another feature that comes in my mind is PostgreSQL’s pg_trgm module that allows fuzzy text comparison based on trigrams (i.e. word is divided into sets of three letters, ‘car’ would result in trigrams ‘ ca’, ‘car’ and ‘ar ‘ — with the spaces). As far as I remember, it comes with the basic installation so it should be simple to install. Basically, you can then compare strings that are not exactly like each other but within a certain percentage.

SELECT `name` FROM `table` WHERE similarity('Joe Doe',`name`)>0.5

The above would match all names that are relatively similar (more than 50 percent) to ‘Joe Doe’. For example, ‘John Doe’ would most likely match.

And yet again, MySQL and other DB’s could have the same feature. And I intentionally left out the phonetic (“sounds like”) comparison features that are very often built-in (AFAIK MySQL and PHP should have those included by default). This article knows MySQL’s “sounds like” operator is quite logically SOUNDS LIKE.

Text files and SQL

A lot of search queries are looking for something that combines text files and SQL. As far as I know, there is no SQL engine that writes and reads raw text files as the tables (other than my hack mentioned in the first paragraph).

However, at least what I have tooled with the idea, it can be quite useful. Bundle that with the fuzzy and regexp comparison from the bigger databases and it’s a killer idea right there.

Dumping tables into text files

Most databases come with a tool that can dump (or export) the database tables into SQL statements that can then be used on another table to duplicate the structure and the data. It is quite safe to say most bigger databases have their own command line tools that can do this, a visual database tool such as phpMyAdmin for MySQL and of course numerous desktop applications for that.

I use SQLyog for MySQL tinkering. In SQLyog it is as simple as selecting the table (or database) and then clicking DB/Table > Export as SQL Statements and soon you have a text file (or XML file or…). And doing it backwards is as simple as copypasting the text file contents and running the statements.

Checking if two tables are identical

Here’s a method for comparing two tables for identical data is to inner join both tables (returns only matching rows) and checking if the number of rows is equal to the total number of rows in a table. The idea is that we use every field to be checked as a foreign key to the other table.

SELECT COUNT(*)/(SELECT COUNT(*) FROM `myFriends`) as `equality` 
  FROM `myFriends` as a, `yourFriends` as b
  WHERE a.`name`=b.`name` AND a.`phoneNumber`=b.`phoneNumber`

Column `equality` in the query result is a percentage of how many matching lines there are compared to the total row count. The above won’t care about the order of the rows which may or may not be useful (unless you include something like an auto incrementing key in the comparison).

…

There, another almost useful page added to the Internets. I hope the above will be helpful at least to some random visitors.

18 Apr

A Blogging Experiment

I’m going to try something that is not entirely original but still should be useful. I have found people often come across my page when looking for something with a search engine, yet the page they land on isn’t exactly what they were looking for. Or to say it in other words, I’ll try to help random visitors to find what they are looking for (or, you could argue this is some kind of search engine optimization).

At least Syd Lexia frequently does this on his page, albeit humorously (even though he does often answer a simple question as well as it can be answered). My idea is to either write about the subjects as much as I know about them or simply point out how to find the needed information.

For example, people are often looking for image re-targeting when they click their way to my post about a crap example of the seam carving algorithm. There are much nicer examples out there. I’d like the random visitor to know that so the time is not wasted when skimming through my writings.

So, expect more useless information about things that interest the random visitor (not you).

12 Apr

Homeopathy and the Computer

The concept of homeopathic software. Why didn’t I think of this before? It’s all so easy now.

Just add water.

kometbomb

Programming and stuff.