URL parsing in Python, Perl, MySQL

There are many situations when one needs to parse the URL. It could be in web log parsing or Google Keyword or other performance data that is downloaded through APIs. Application languages like Python, Perl, PHP etc provide many modules to parse or you can use raw regular expression to parse the URLs. Below I have shown few ways of doing it with a simple example - either in applications before loading into staging area of data warehouse or parsing it in SQL (MySQL) after the load.

In the example url http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html, I would like to get sub-domain name and file name parts i.e; www.hiregion.com and sem-google-aswords-api-data-retriever.html.


Other posts that may be of interest:
1. Regular expressions - An intro
2. 99-bottles of bear in many languages


import time
import re




t2 = time.time()
for ii in range(1000000):  #-- Parse it million times...
    m = re.search(r'http://(.*?)/\d{4}/\d{2}/(.*$)', 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html')
    #- Saved to m.groups
t3 = time.time()
 
print 'regex took:', t3 - t2, 'seconds'





2. Using Python urlparse module

import re
from urlparse import urlparse

t2 = time.time()
for ii in range(1000000):   #-- Parse a million times...
   parsed = urlparse('http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html')
   file = re.sub('/\d{4}/\d{2}/', '',parsed.path)  
   #- Saved to parsed.netloc and file

t3 = time.time()
print 'urlparse took:', t2 - t1, 'seconds'

3. Using Perl Regular Expression

Regular expression is simple and same as the one used above in Python.
#!/cygdrive/c/perl/bin/perl

use strict;
use Time::HiRes qw(gettimeofday tv_interval);

my $str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.htm';
my $th1 = [gettimeofday()];

for my $ii (1..1000000) {   #-- Parse a million times
    $str =~ m#http://(.*?)/\d{4}/\d{2}/(.*$)#;
    #- Saved to $1 and $2
}

my $th2 = [gettimeofday()];

print "tv_interval", tv_interval($th1), "\n";


If you would like to transform url after loading the data into staging tables (say, after the bulk load), the following one can be used.

4. Using MySQL (5.1 version)

SET @str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html' ;

PREPARE select_q FROM
    'SELECT  CASE 
                WHEN ? REGEXP ''^http://'' then substring_index(?, ''/'', -1)  
             ELSE  NULL end file,
             CASE 
                WHEN ? REGEXP  ''^http://'' then substring_index(substring_index(?, "/", 3), ''/'', -1)
                ELSE NULL end sub_domain
     ' ;
EXECUTE select_q USING @str, @str, @str, @str;

# Output...
# file                                            sub_domain
# sem-google-adwords-api-data-retriever.html      www.hiregion.com

With Perl (5.10.1) and Python (2.6) running on the same system, perl took 0.999 seconds, Python regular expressions took 2.641 seconds and urlparse took 6.5 seconds.


Cheers,
Shiva

~
~

3 comments:

  1. There are lots of information about latest technology, like Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. This information seems to be more unique and interesting. Thanks for sharing.
    Big Data Training in Chennai | Hadoop Course in Chennai | Big Data Training in Chennai

    ReplyDelete
  2. Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
    Hadoop Training in Chennai | Hadoop Training Chennai

    ReplyDelete
  3. Thanks for sharing Our team, who never fails to give the best of all time in best digital marketing company in Chennai. We creating, developing optimizing, analyzing, ranking, monitoring the traffic, hosting web services, etc. We care about the whole things, to be done for you.

    ReplyDelete