In the example url http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html, I would like to get sub-domain name and file name parts i.e; www.hiregion.com and sem-google-aswords-api-data-retriever.html.
Other posts that may be of interest:
1. Regular expressions - An intro
2. 99-bottles of bear in many languages
import time import re t2 = time.time() for ii in range(1000000): #-- Parse it million times... m = re.search(r'http://(.*?)/\d{4}/\d{2}/(.*$)', 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html') #- Saved to m.groups t3 = time.time() print 'regex took:', t3 - t2, 'seconds'
2. Using Python urlparse module
import re from urlparse import urlparse t2 = time.time() for ii in range(1000000): #-- Parse a million times... parsed = urlparse('http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html') file = re.sub('/\d{4}/\d{2}/', '',parsed.path) #- Saved to parsed.netloc and file t3 = time.time() print 'urlparse took:', t2 - t1, 'seconds'
3. Using Perl Regular Expression
Regular expression is simple and same as the one used above in Python.
#!/cygdrive/c/perl/bin/perl use strict; use Time::HiRes qw(gettimeofday tv_interval); my $str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.htm'; my $th1 = [gettimeofday()]; for my $ii (1..1000000) { #-- Parse a million times $str =~ m#http://(.*?)/\d{4}/\d{2}/(.*$)#; #- Saved to $1 and $2 } my $th2 = [gettimeofday()]; print "tv_interval", tv_interval($th1), "\n";
If you would like to transform url after loading the data into staging tables (say, after the bulk load), the following one can be used.
4. Using MySQL (5.1 version)
SET @str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html' ; PREPARE select_q FROM 'SELECT CASE WHEN ? REGEXP ''^http://'' then substring_index(?, ''/'', -1) ELSE NULL end file, CASE WHEN ? REGEXP ''^http://'' then substring_index(substring_index(?, "/", 3), ''/'', -1) ELSE NULL end sub_domain ' ; EXECUTE select_q USING @str, @str, @str, @str; # Output... # file sub_domain # sem-google-adwords-api-data-retriever.html www.hiregion.com
With Perl (5.10.1) and Python (2.6) running on the same system, perl took 0.999 seconds, Python regular expressions took 2.641 seconds and urlparse took 6.5 seconds.
Cheers,
Shiva
~
~
There are lots of information about latest technology, like Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. This information seems to be more unique and interesting. Thanks for sharing.
ReplyDeleteBig Data Training in Chennai | Hadoop Course in Chennai | Big Data Training in Chennai
Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
ReplyDeleteHadoop Training in Chennai | Hadoop Training Chennai
Thanks for sharing Our team, who never fails to give the best of all time in best digital marketing company in Chennai. We creating, developing optimizing, analyzing, ranking, monitoring the traffic, hosting web services, etc. We care about the whole things, to be done for you.
ReplyDelete