20 Apr

Blog Experiment: Comparing text and SQL

A lot of visitors come from search engines to my page about a little hack I made that acts as a very thin layer between text files and SQLite. Most of the search queries are something akin to “compare text files in SQL“, “SQL text compare” or other SQL and text comparison related searches. So, I’m going to write what I know about the subject.

Comparing text in SQL

Most implementations of SQL can do text comparison in a variety of ways. It is safe to assume any two text strings can be compared with the equal operator and the LIKE operator. The difference between the two basic operators is that the equal operator usually is case-sensitive by default (see below before assuming this is always true) while LIKE is case-insensitive. LIKE also has a nice feature: wildcards. You can check if a string contains a shorter substring.

1   SELECT `name` FROM `table` WHERE `name`='Joe Doe';
2   SELECT `name` FROM `table` WHERE `name` LIKE 'Joe Doe';
3   SELECT `name` FROM `table` WHERE `name`='% Doe';

All the above queries return the rows in which the field `name` (1) is exactly ‘Joe Doe’, not e.g. ‘joe DoE’, (2) the same as (1) but case-insensitive and finally (3) ends with the string ‘% Doe’ (case-insensitive).

As you can see, the percent string acts as a wildcard which can be substituted with any sequence of characters – much like the asterisk (*) when talking about e.g. file paths. The underscore (_) character can be replaced by any one character, this is comparable to what question mark usually does with filenames and paths. The wildcard characters can be different between implementations.

The case-sensitivity can be changed. For example, the equal operator can be case-insensitive if the collation is set to a case-insensitive method. Similarly, the LIKE operator can be set to be-case sensitive. See your SQL implementation manual how to do this.

When it comes to speed, you should always use the equal operator before LIKE. In cases like the below query it is impossible to benefit from an index:

SELECT `name` FROM `table` WHERE `name` LIKE '%Doe%'

Why? Because the substring ‘Doe’ can be anywhere in the compared string. However, most SQL implementations can gain speed from indexes when you have a query something like this:

SELECT `name` FROM `table` WHERE `name` LIKE 'Joe%'

You can think of the speed-up as something similar to trying to find a word in a dictionary. If you have a table of contents, you don’t need to look through every page: quickly flipping pages to a known page (words starting with ‘Joe’) and then finding the word on the page is a lot faster. Again, I’m sure different databases have different limitations.

Advanced comparison

One very nice feature of MySQL is that it has a special kind of comparison built in: regular expressions. The REGEXP operator is used like the LIKE operator but the string after the operator is — a regular expression.

SELECT `name` FROM `table` WHERE `name` REGEXP '^[JM]oe Doe$'

The above example matches names ‘Joe Doe’ and ‘Moe Doe’. Regular expressions of course can be much more complex than that but they are useful when you need just a little bit more functionality than what LIKE can offer.

Fuzzy comparison

Another feature that comes in my mind is PostgreSQL’s pg_trgm module that allows fuzzy text comparison based on trigrams (i.e. word is divided into sets of three letters, ‘car’ would result in trigrams ‘ ca’, ‘car’ and ‘ar ‘ — with the spaces). As far as I remember, it comes with the basic installation so it should be simple to install. Basically, you can then compare strings that are not exactly like each other but within a certain percentage.

SELECT `name` FROM `table` WHERE similarity('Joe Doe',`name`)>0.5

The above would match all names that are relatively similar (more than 50 percent) to ‘Joe Doe’. For example, ‘John Doe’ would most likely match.

And yet again, MySQL and other DB’s could have the same feature. And I intentionally left out the phonetic (“sounds like”) comparison features that are very often built-in (AFAIK MySQL and PHP should have those included by default). This article knows MySQL’s “sounds like” operator is quite logically SOUNDS LIKE.

Text files and SQL

A lot of search queries are looking for something that combines text files and SQL. As far as I know, there is no SQL engine that writes and reads raw text files as the tables (other than my hack mentioned in the first paragraph).

However, at least what I have tooled with the idea, it can be quite useful. Bundle that with the fuzzy and regexp comparison from the bigger databases and it’s a killer idea right there.

Dumping tables into text files

Most databases come with a tool that can dump (or export) the database tables into SQL statements that can then be used on another table to duplicate the structure and the data. It is quite safe to say most bigger databases have their own command line tools that can do this, a visual database tool such as phpMyAdmin for MySQL and of course numerous desktop applications for that.

I use SQLyog for MySQL tinkering. In SQLyog it is as simple as selecting the table (or database) and then clicking DB/Table > Export as SQL Statements and soon you have a text file (or XML file or…). And doing it backwards is as simple as copypasting the text file contents and running the statements.

Checking if two tables are identical

Here’s a method for comparing two tables for identical data is to inner join both tables (returns only matching rows) and checking if the number of rows is equal to the total number of rows in a table. The idea is that we use every field to be checked as a foreign key to the other table.

SELECT COUNT(*)/(SELECT COUNT(*) FROM `myFriends`) as `equality` 
  FROM `myFriends` as a, `yourFriends` as b
  WHERE a.`name`=b.`name` AND a.`phoneNumber`=b.`phoneNumber`

Column `equality` in the query result is a percentage of how many matching lines there are compared to the total row count. The above won’t care about the order of the rows which may or may not be useful (unless you include something like an auto incrementing key in the comparison).

There, another almost useful page added to the Internets. I hope the above will be helpful at least to some random visitors.