Data and Technology: Data loading through Talend ETL Studio

In couple of my earlier posts I talked about loading data pulled from search engines and then loading into staging tables of a data warehouse. You can load data in couple of ways including databases bulk load tools - for example, MySQL "LOAD DATA INFILE " statement or using Talend's ETL studio or similar tools (Pentaho, Informatica, etc.)

In this article, I will be loading a simple data file with more than 200,000 (200K) rows of keyword performance data for illustration purpose and the setup used was
MySQL 5.1 (Community version)
Talend Open Studio (3.2.2)
Perl 5.10.1

Other posts that may be of interest:
1. URL parsing in Python, Perl, MySQL - an example
2. Google Adwords Data Retriever
3. Google Analytics Data Retriever

The data flows through 6 components (tFileInputDelimited, tPerlRow, tSortRow, tFileOutputDelimited, tFilterRow and tMysqlOutput) starting at input data file read and transformed into another output file and then loaded into database. Component #3 (sorting component) is not needed but shown here to stress the ease with which a developer can quickly put together a ETL process. Palette tab on the right hand side houses these components and many more. If you use Java instead of Perl, you will likely have more components to use and for list of all available components take at Talend Forge.

Input record format is a tab seperated fields with one record per line and so is output file in component 4.

# Typical fields in the file are date, account_id, campaign_id, ad_group_id, keyword_id, CPC, CTR, Cost, Impressions, url etc.
# 2009-12-12   9876543210   34987650   1147382910   1234567890  0.07  
# 0.226632  247.98   15444  http://hiregion.com/2009/10/google-analytics-data-retriever-api-sem_25.html

I provide the following regular expression to parse each URL to extract domain name in component 2.
/(https?://)?(www\.)?([a-zA-Z0-9_%]*)\b\.[a-z]{2,4}(\.[a-z]{2})?/ig

And this parses any URL with correct formats like http://www.amazon.com or https://amazon.com or http://www.amazon.co.uk or amazon.com and more.

Component 4 sorts the input stream coming from the regexp output and dumps into a output file which is then read into a filter which filters out any record that doesn't have the primary key before loading into staging database. You can load into database without having the primary key constraint to further speed it up and cleanup the duplicates before pushing into either dimensions or facts tables.

The whole run took little over a minute (~63 seconds) running on a test system running all three (MySQL, Talend, Perl). For daily load, you can make use of scheduler to kick-off the job and automate the loading process.

Hope that helps,
Shiva

2 comments:

AnonymousMarch 19, 2013 at 8:22 AM
Regarding your regex: underscores are not allowed in host names in DNS (RFC 2782 explains why). Also, by looking at it, it seems that expression will not match a domain like example.uk.com.
sunshineprofeJanuary 28, 2019 at 1:35 AM
Well with your permission let me grab your RSS feed to keep updated with forthcoming post. Thanks a million, and please keep up the gratifying work.
iosh course in chennai