Email parsing with regular expressions

Emails always pose interesting challenges to load and handle for doing any kind of analysis with them - analysing number of users from email providers like Yahoo, Google, Hotmail, etc. or finding top country specific domains or users with specific email pattern.

RFC 822, 2822 (April, 2001) and 5322 (October,2008) define the specification and syntax for text messages exchanged between computers.  The RFCs cover not just the email address specifications but also the envelop and contents.  Most of applications adhere to subset of these specs with more stringent email addresses.  For example, though address specs allow for ALPHA (a-z, A-Z), DIGITS (0-9) and special characters like "!", "#", "$", "%", etc. and even more special characters with quoted strings.  But typically email service providers won't allow you to create an email address with special characters like "$" or "!", etc.  For more spec details see section 3.4 and 3.4.1 in rfc5322RFC 3696 explains in much easier terms email addresses, URI and HTTP URL.

One can load email addresses in to staging area of data warehouse (DW) and then perform validation and cleaning through ETL's in-built utilities or components.  But to handle many of different possibilities and ability to quickly update or modify any new format encountered, it is better to validate and clean it through application programs before loading into staging or final tables.  In this way the load can also be distributed through file split and multi-threading or processes running on different systems.

Following regular expression pulled from regular-expressions.info handles all most all email addresses (more than 99.9%) and you can tweak to it suit your needs for performance or handling specific emails only. For example, you may not want to allow any email that has special characters like #, %, $, {, }, /, *, etc.  I have explained below in detail how this RE parses an email.

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)\b

Same as above with color coding for further explanation:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)\b

Address mainly has two parts - one before (part 1) "@" and one after it (part 2).

Part 1:
  1.  [a-z0-9!#$%&'*+/=?^_`{|}~-]  -- A set of characters i.e., between square brackets [ ], and in this case all letters "a to z" (either upper case or lower case) and numerals "0 to 9" and special characters "!#$%&'*+/=?^_`{|}~-" are allowed
  2. --  One or more times of the above character set. Meaning "joe" or "!j!" or "abc" are all valid i.e.,  Alphabets and ! are repeated one or more times. But Part1.1 will not match "(abc)" or "joe<>" because "(", ")", "<", ">" are not in the above character set.
  3.  (  -- Allow grouping also start remembering the matched string.  To avoid storing use "?:" right after "(".
  4.  ?: -- Don't need to remember or store what is matched.  When grouping is done through "(", it is also remembered in variables $1, $2, $3, etc. With "?:" following "(" indicates that not to store it. This will help with the performance but if you want the matched string to use for further processing, then you can remove "?:" from above regular expression.
  5. \. --  It indicates "." can appear in email after bullet 1 above but "." has significance for the regular expression (RE) itself. In RE "." means any single character so we need to escape it and pass it "." down to parsing engine.
  6. [a-z0-9!#$%&'*+/=?^_`{|}~-]  -- See 1 above.
  7.  +  -- See 2 above.  Also, if a "." appears then there must be at least one of characters in 6.
  8. *  -- Means zero or more times. In this case all characters in step 5, 6 & 7 can be optional. That is "joe.a" is valid but not "joe."
  9.  @ -- "@" sign is a must in the email address.  It should appear once.
Part 2:

     Part 2A: (?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
  1. + -- All with the outer parenthesis must appear at least once.  It enforcing rule that after "@" sign in email, there must be at least one charcter.
  2. ?: --  Don't need to remember or store what is matched.  Helps in performance when dealing with millions of email parsing.
  3.  [a-z0-9] -- First character after the "@" must be a alpha-numeral (a,b,c...z or 1,2,...9)
  4.  (?:[a-z0-9-]*[a-z0-9])-- "?" at the end indicates 0 or 1 time of alpha-numeral characters.  That is, after a first character in step 3 next characters must be alpha-numerals only but they are optional.
  5. \. -- See step 5 in part 1.
     

     Part 2B:  (?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)
  1.   [A-Z]{2} -- Any two characters after the final ".". This covers country top level domains (TLD) like "uk" (for United Kingdom), "in" (India), "ca"(Canada), etc.  See this list of standard country codes.
  2.  | -- Is a "OR" operator.  It matches "com" or "org" or "museum", etc.
And final "\b" at the end is a word boundary anchor. 

During this validation and clean up you can also add length check on the email part 1 and 2.  RFCs specify of length 64 characters (octet) for part 1 (before @ character) and 255 characters (octet) after @ symbol.  These implementations can effectively handle more than 99.999% of emails.

Let me know if you have different regular expression or different way of parsing emails.

HTH,
Shiva

No comments:

Post a Comment