Data and Technology: December 2012

Last few months I started woking in Greenplum/ postgres environment and was pleasantly surprised to see the regular expression capabilities. Compared to Mysql (5.1) it has quite advanced RE features. For example I ended up using the following query to strip the decimal numbers from another string.

Input strings has the format "BucketN_N[ -N | + ]" where N is decimal number. For example, "Bucket1_1" or "Bucket7_21-50" or "Bucket5_100+" are all valid and output should extract bucket number (first N), minium value (second N) and maximum value of the bucket (third optional N).

The query

SELECT
  bucket_string
  ,substring(qty_bucket, 7, 1)::integer as bucket_num
  ,regexp_replace(bucket_string, '^Bucket\\d{1,}_(\\d{1,})([-+].*)?', E'\\1')::integer as minimum_qty
  ,regexp_replace(regexp_replace(bucket_string, '.*-(\\d{1,})', E'\\1'),  '.*_(\\d{1,})([+]?)$', E'\\1')::integer
  as maximum_qty
FROM a_table
ORDER BY bucket_string

Though I could have used other functions like substring and other, it was fun using regular expression in smallest code. regexp_replace is the workhorse taking a string and replacing it with matched string (\\1 or $1). "^ and $" are anchors to beginning and end of string, \\d matches with decimal numbers and {1,} means 1 to many quantifier.

And more with regexp_matches and array selection. I wanted to select offset number of days from 2012 December 01 in this simplified example. The offset is a text in another external table of Greenplum and data is not always clean and could have some text before the offset.

SELECT d
  regexp_matches(d, '(?:.*_)?(\\d+)$')    re_ary
 , '2012-12-01'::date+((regexp_matches(d, '(?:.*_)?(\\d+)$'))[1] || 'days')::interval   new_d
FROM
(
    SELECT '90' d
    union
    SELECT 'new_90' d
    union
    SELECT 'old_90_10' d
) tt

returns

d         re_ary      new_d
-------------------------------------
90         {90}      2013-03-01 00:00:00
new_90     {90}      2013-03-01 00:00:00
new_90_10  {10}      2012-12-11 00:00:00

With "?:" is non-capturing group and this matched text is optional meaning any text before string ending number is not-captured for later use and original string may have it but not all the time. '$' anchors it to the end of string with regexp_matches returning an array of single element and subscript [1] selecting it.

Github weekend issue - Dec.23, 2012

Greenplum Postgres Regexp fun