Data and Technology: Regular Expressions

MySQL regular expressions (RE) are a powerful tools that can be very useful in SQL string searches. They enable software engineer to build statements that are very concise and to handle complex string operations that otherwise wouldn't be possible.

If you are new to regular expressions or would like to know more about them, http://www.regular-expressions.info/ is a good site to visit. You can also get a RE tutorial at net|tuts+. Following is quick list of meta characters that can get you started in using them.

. => A dot matches single character

* => An asterisk matches zero or more of previous matched tokens

? => A question mark matches zero or one time the previous matched token

$ => A dollar at the end anchors the search to the end of string

^ => A caret symbol anchors the search to the beginning of the string

| => A pipe matches either of the two. Example: abc|xyz => either 'abc' or 'xyz'
{m,n} => A quantifier matching between 'm' and 'n' times. m & n are integers.

Different computer languages have some variations when it comes to more advanced searches and how they handle given character sets. MySQL uses REGEXP string function to implement and matches the string in case "in-sensitive" mode and to match otherwise see this blog.

It is important to note that MySQL REGEXP returns 1 or 0 depending on whether it finds the match in the pattern or not. If a match that your looking for in the pattern or the pattern itself is null, then it returns NULL. Other application languages may return an array or list or true/false. Thus use of REGEXP in where clause is more appropriate than in either SELECT or others.

For example:

SELECT 'AdWords' REGEXP 'A*W*';

returns 1

Trying the following on MySQL Sakila database:

SELECT * FROM country  WHERE country REGEXP '^U.*' ;

and it returns 'Ukraine, United Arab Emirates, United Kingdom & United States'; all the country names that begin with 'U' and followed by 0 (zero) or more characters.

For demonstration purpose, let us assume that we would like to get a name of the European explorer whose first name starts with 'M' and ends with 'O'. Also, first and last name are not normalized and stored in single column.

SELECT name FROM (
    SELECT  substring_index(name, ' ', 1) REGEXP '^M.*O' YN, name 
    FROM (
        SELECT 'Christopher Columbus' name 
        UNION 
        SELECT 'Marco Polo'  name 
        UNION 
        SELECT 'Vasco Da Gama'  name 
        UNION 
        SELECT  'Francisco Pizarro' name
    ) x
) y
WHERE YN = 1

'^M.*O' pattern matches any expression that begins (^) with 'F' and may have any character (.) zero or more times (*) and ending with 'D'. The result is Marco Polo.

There are also many special constructs one can use to handle group of specific characters or positions in the string, etc. Below are the ones I end up using more often than others.

[[:<:]]token[[:>:]]

Searches the token in the string that is neither preceded nor followed by any alpha-numerals or "_" (underscore). For example,

SELECT 'Here is a token in a string' REGEXP '[[:<:]]token[[:>:]]'

Returns 1 (true), where as the following ones return 0 (false)

SELECT 'Here is a _token in a string' REGEXP '[[:<:]]token[[:>:]]'
SELECT 'Here is a tokens in a string' REGEXP '[[:<:]]token[[:>:]]'

[:character_class:]

This would match any group of characters that belong to predefined character class. Few examples of classes are "alpha" - alpha characters, "digit" - digit characters, "space" - any space, tab, newline, etc.

SELECT column1, length(column1) from table
WHERE  column1 REGEXP '[[:space:]]' ;

will return only those column1 and its length if they have any space embedded in them - quite helpful when you want to identify that any space character has crept in when it is not suppose to.

REGEXP's complex searches come with a caveat of the way the search is done at the low levels - bits and bytes. Both RLIKE and REGEXP work in byte-wise fashion and can result in unexpected results when used on multi-byte strings. For any LATIN character set one will be safe to use them but for UNICODE need to be careful.

Hope that helps,

Shiva

Regular Expressions

No comments:

Post a Comment