Mapping URL to top level domain

When parsing web logs as part of web analytics, one might be interested in the top level domain analysis before drilling down further.  Following query maps an url (referer) to top level domain.  Top level domain include not only generic ones like '.com', '.edu', '.org', etc but also country level top levels like '.uk', '.de', etc.  You can download the list from IANA (Internet Assigned Numbers Authority) .

After downloading it import into a table dim_int_mktg_tld which you can join with the staging data.
CREATE TABLE `dim_int_mktg_tld` (
  `tld` varchar(16) NOT NULL,
  `type` varchar(32) default 'country-code',  #-- Can be country-code or generic
  `info` varchar(500) default NULL,
  PRIMARY KEY  (`tld`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

#--  Join with yesterdays staging data and get TLD
SELECT WL.url, TLD.tld
FROM stg_weblog_date WL
LEFT OUTER JOIN dim_int_mktg_tld TLD
ON substring_index(
     substring_index(
       substring_index(url, '://', -1),
       '/',1), 
    '.',-1) = TLD.tld
WHERE WL.date = DATE_ADD(CURRENT_DATE, interval -1 day)

#-- Example: http://www.abc.co.au/in/test
#-- Innermost substring_index returns everything to the right of '://' 
#--   i.e, www.abc.co.au/in/test,
#-- then the middle one gets everything to the left of first '/'
#--   i.e, www.abc.co.au
#-- then the outer most substring_index returns everything to the right of last '.'
#--   i.e, au
#-- which is then joined with TLD table

HTH,
Shiva

No comments:

Post a Comment