Data and Technology: 2010

MySQL Connection over SSH

When MySQL server is setup to accept only localhost connection through bind_address in /etc/mysql/mysql.cnf (configuration)

# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
bind-address          = 127.0.0.1

or you want to connect as "root" to mysql and use GUI clients like Workbench or Toad, etc. you can use SSH tunneling.

In operating systems that provide built-in ssh support it is made little easier. Mac OS is one such OS, which I will use as an example below.

Assuming you have already setup ssh connection, run the following command from the local terminal. If you have not setup the ssh, see this brief introductory ssh article. Or run > man ssh for more details.

prompt>  ssh -L 3307:localhost:3306  user@remote.com

and then let this terminal running. Here I am using ssh -L [bind_address:]port:host:hostport parameter and skipped bind_address which is needed only when you have more than one IP attached (multi-IP) to your local system which is typically not the case.

    3307 - The port that you will use on the other clients that need to connect to remote mysql. For example, Workbench.

    localhost:3306 - The remote login details as if you had connected to it through ssh and connecting to mysql instance running on port 3306. If you had successfully, ssh-ed to remote host at the prompt you would have run > mysql -h localhost -u user_name -P 3306

    user@remote.com - Remote user name and remote host

Now start Workbench and setup new connection string by providing the details as if you are connecting to a mysql instance on your local machine running on port 3307. See figure 1. ssh will automatically tunnels your connection to remote host's mysql! See figure 2. On ending the tunneling setup either my closing the terminal where you were running "ssh -L" command and killing it will disconnect the connection to Workbench.

Figure 1.

Figure 2.

Note: Same technique can be used to for other client applications to connect to any of remote applications/databases over SSH.

Have fun,

Shiva

DW solution - IP mapping to Geo Locations

In my previous few posts on the blog, I covered how to convert IP addresses to corresponding integers and back. When performing a Data Warehouse (DW) tables join between IP addresses for location analysis joining on integer IP column is much faster than joining on string column. As noted in other posts, this article covers IPv4 but technique could be used for IPv6 addresses on using 2 BIGINT.

For geo location info, you can use MaxMind's Geo-City data that is provided in two entities - one with mapping IP block to a location ID and another one with location ID to city/state/country. It is one to many (1:M) relationship and to find an IP address location one need to find the matching IP block and then map it to city or country. There are posts that describe in detail how to perform it effeciently for a given IP. These techniques will work for an OLTP environment where you pass known IP address from UI or application layer to database and gets its city/country.

Fast MySQL Range Queries on MaxMind GeoIP Tables byAndy Skelton
MaxMind GeoIP Install Setup Tutorial by Bartomedia
Finding a user's country/region based on IP by Shlomi Noach
On efficiently geo-referencing IPs with MaxMind GeoIP and MySQL GIS by Jeremy Cole

Geo-City data has following fields seperated by comma and row terminated by new line (\n):

Location ID
Country Code
State/Region
City Name
Postal Code (US only)
Latitude
Longitude
Metro Code
Area Code

IP Block has:

Start IP Address
End IP Address
Location ID

While loading this dimensional data during ETL process, you should also do the IP address transformation to integer as explained in "Converting an IP address to an integer". And Geo-City dimensional table will also have partition range (the reason for this post) and surrogate key (id). You should also perform other data validation like non-overlapping ranges, valid country codes, etc. which are all typical standard activities of DW ETL process. The time, resource and effort put into cleaning and validating the data before pushing the data into final fact tables will pay well within short period of time. To fix a data issue and reconcile the reports in DW/BI environment always leads to lower ROI.

You can also denormalize the above tables and create a single table. In either case by joining on range of IP address for a given set of IPs will result in the same slow performance.

dim_geo_ip_city_location holds around 5.9million rows with following attributes:

CREATE TABLE dim_geo_ip_city_location (
    location_id int NOT NULL,
    start_ip varchar(16) NOT NULL,
    end_ip  varchar(16) NOT NULL,
    country char(2),
    region varchar(64), 
    city varchar(128),
    postal_code varchar(32),
    latitude decimal(8,4),
    longitude decimal(8,4),
    metro_code varchar(8),
    area_code varchar(8),
    ...
)ENIGNE=...

CREATE TABLE dim_ip_block (
    start_ip_num bigint NOT NULL,
    end_ip_num bigint NOT NULL,
    location_id int NOT NULL,
        ...
)ENGINE=...

In MySQL (5.0) optimizer performs range operation effeciently with a constant meaning that you have given the query an IP address like "WHERE ip_addr between 1230 and 1234", for example, but not with range operations without constants. For more info on using range optimization see mysql documentation and for indexing see "Mastering the Art of (MySQL) Indexing".

I did not find any article providing some insights into doing DW/BI analysis on IP addresses. In a DW environment the business user is interested in slicing and dicing the hundreds of thousands to million of IPs. Just trying to find few thousand IP address locations (See query 1 below) was taking more than an hour when joined between lookup table (a table similar to dim_geo_ip_city_location but without partition column (ip_range_partition) and id) and ip address join table even with appropriate indexes like index on (start_ip_num, end_ip_num) or start_ip_num and end_ip_num.

Query 1:

-- Simple query to select IP block
SELECT IA.ip_num, B.location_id, B.start_ip_num, B.end_ip_num
FROM ip_addresses IA use index(...)
JOIN dim_ip_block B use index(...)
WHERE IA.ip_num between B.start_ip_num and B.end_ip_num

This leads to MySQL performing full scans. And in the case of ip_block table (B) it doesn't use any index. For this to be avoided, I partitioned the dimension (lookup) table into 1000 parts and since this table has ~5.9 Million rows each partition will result in ~5,900 rows in each partition. For more on data set partitions see its wiki. The partitions should be numbered in the ascending order with respect to ascending range of ip blocks (start_ip_num). Then also create a another table (dim_ip_range_partition) with start_ip_num and end_ip_num for this each range basically creating super IP blocks.

CREATE TABLE dim_ip_range_partition (
     partition_id int NOT NULL AUTO_INCREMENT,
     start_ip_num bigint NOT NULL,
     end_ip_num bigint NOT NULL,
     ....
) ENGINE=InnoDB ;

Also, create a primary key on patition_id and index on (start_ip_num, end_ip_num) on this table.
The solution is based on two things that there is no overlapping IP block and IP blocks are ordered (in ascending order).

To populate the above table,

Add id (row_id) column to dim_ip_block
Add partition_id integer column to dim_ip_block
Update id from 1 to n (highest start_ip_num) in ascending order of start_ip_num
Partition "n" ids into P parts (n/P) where n > 1 and 1 < P < n
Assign each part/partition ascending integer value
Now update above table dim_ip_range_partition P_TABLE from dim_ip_block B_TABLE where P_TABLE partition id between 1 and P of B_TABLE and P_TABLE.start_ip_num is lowest B_TABLE.start_ip_num of the corresponding partition and P_TABLE.end_ip_num is highest of corresponding partition. In other words...

         INSERT INTO dim_ip_range_partition
            (partition_id, start_ip_num, end_ip_num, ...)
         SELECT partition_id, min(start_ip_num), max(end_ip_num), ...
         FROM dim_ip_block
         GROUP BY partition_id
         ORDER BY partition_id asc

By doing so, we have created a super IP blocks each having n/P blocks from the original ip_blocks.   And the following query is one way to extract city/country info for given set of IP addresses.

SELECT t.ip_num, t.ip, B.start_ip_num, B.start_ip, B.end_ip_num, B.end_ip
       ,L.city, L.country
FROM 
(
   --- Get the super ip block (partition) 
   SELECT C.ip_num, max(partition_id) partition_id, max(P.start_ip_num) start_ip_num
   FROM ip_addresses C use index(...)
   JOIN dim_ip_range_partition P 
   WHERE C.ip_num >= P.start_ip_num
   GROUP BY C.ip_num
) t
LEFT OUTER JOIN mktg_ip_block B use index(...)  
  ON  t.partition_id = B.partition_id     --- And join on this partition
  AND t.ip_num >= B.start_ip_num 
  AND t.ip_num <= B.end_ip_num
JOIN dim_geo_ip_city_location L          --- Get city/country
  ON B.location_id = L.location_id

The query that would have taken more than hour returns within 30 seconds now! You can further optimize the above design with say more denormalized relations, having a large number of partitions (this is a trade-off, with highest partition n = P, each partition is same as original table which will result in worse off performance!) on both lookup table and lookedup table. Or extending the partitioning concept to more levels (super super block, ....) you can reduce the number of joining rows with little bit more complex query.

The solution is a typical Computer Science way of handling the complex problem by dividing into smaller ones and solving it. In this it is more specific to MySQL solution for large data sets with range operations.

Cheers,
Shiva

Tools to convert IP address

Following tool converts an ip address to other formats. For more on IP address conversion check out part 1 and 2.

Converting an IP address to an integer and reverse (Part 1)
IP Address conversions (Part 2)

IP address converter:

Enter IP:
	(e.g. 192.168.100.200 or 3232261320 etc)

You can also check out 2's complement of a number here. This will be handy when one needs to check for signed numbers (2's complement).

IP Address conversions

This is the part 2 of converting an IP address to an integer or vice versa. For more details and little introdution see Part 1.

Converting an ip address to integer and back is little straight forward in MySQL compared to SQL Server since SQL Server (2008) doesn't provide bit shift operators.

In short you can't do,

SELECT 1<<2

nor

SELECT 3232261320 -- IP '192.168.100.200'
* (256*256*256*256*256*256) -- Left shift 6 bytes
/ (256*256*256*256*256*256*256) -- Right shift 7 bytes to get 2nd octet

One will encounter "Arithmetic overflow error converting expression to data type int". Even casting each integer to bigint will not solve the problem as the multiplication results in overflow of signed bigint. And there is no unsigned bigint in SQL Server.

Another way to solve this is to do the following in SQL Server by using bit AND. Retain only the byte that is off interest and then right shift the bits appropriately.

#-- IP '192.168.100.200'
SELECT (cast(3232261320 as bigint) & cast(4278190080 as bigint))/(256*256*256);
SELECT (cast(3232261320 as bigint) & 16711680)/(256*256);
SELECT (cast(3232261320 as bigint) & 65280)/(256);
SELECT (cast(3232261320 as bigint) & 255);

#- And the output will be
192
168
100
200

On systems with unsigned bigint you can also do (MySQL)

SELECT cast(3232261320*256*256*256*256 as unsigned)
/ (256*256*256*256*256*256*256) 4th_octet;

SELECT cast(3232261320*256*256*256*256*256*256*256 as unsigned)
/ (256*256*256*256*256*256*256 as unsigned) as 1st_octet;

They will result in outputs 192 and 200 respectively.

With given ip you can simply run the following to get its integer value,

mysql toad>  
SELECT ( ((substring_index(ip, '.',1))<<24) 
       + ((substring_index(substring_index(ip, '.', 2), '.', -1))<<16)
       + ((substring_index(substring_index(ip, '.', 3), '.', -1))<<8)
       + ((substring_index(substring_index(ip, '.', 4), '.', -1)<<0))
)
FROM
( SELECT '192.168.100.200' as ip
  UNION ALL
  SELECT ip FROM som_ip_table;
) t;

Cheers,
Shiva

Converting an IP address to an integer and reverse

In dealing with IP address analysis one may required to map IPs to geo location. There are third party applications or software that perform IP to city, state, country mapping and businesses, for example, can make use of them to see where their web traffic is coming from.

Article covers IPv4 addresses which use 32 bits and similar technique can be used to IPv6 addresses which use 128 bits by using 2 BIGINTs to store (64bits each).

It is easy to handle reasonable load when one is querying for few IPs or few hundred IPs either from simple user interface entering one or more IP addresses or through APIs. In these cases IP addresses can simply be matched in string format without much of performance issue but it can become a performance bottleneck when dealing with millions of rows loaded through ETL process into Data Warehouse environment and reports are run against fact tables with hundreds of millions of rows. Under these circumstances converting IP to an integer will boost the performance substantially.

MySQL provides functions to convert IP string to integer and backwards. For example, inet_aton converts string to a number (integer) and inet_ntoa converts number to integer.

inet_aton and inet_ntoa:

mysql> select inet_aton('192.168.100.200');
+------------------------------+
| inet_aton('192.168.100.200') |
+------------------------------+
|             3232261320 |
+------------------------------+

mysql> select inet_ntoa(3232261320);
+-----------------------+
| inet_ntoa(3232261320) |
+-----------------------+
|     192.168.100.200 |
+-----------------------+
1 row in set (0.00 sec)

NFS Cache - Invisible file issue

I ran into a following situation and it took some time to diagnose the issue and help from couple of folks from DBA and operations team to resolve it. Here is what happened.

I exported a large data set from MySQL to a file in directory /dir_1/dir_2/exported_file.txt, for example, in an application. Then after the file was exported the application went on to consume the file by reading it's content. Since MySQL OUTFILE (exporting data) doesn't overwrite a file if the file name already exists, the code would rename the file to *.bak. See below for pseudo code.

If OUTFILE exists
    Move or Rename OUTFILE to OUTFILE.bak    /* Step 1 */
Run MySQL export to OUTFILE    /* Step 2 */
Check the error code
Read OUTFILE and parse    /* Step 3 */

When I ran the application, it would sometime create the output file and go on to parse it correctly but many a times it would fail in step 1 throwing an error like "file already exists" when in fact it was not. Because I had removed the file with 'rm -f' before rerunning the program. Other times it would fail in step 3 indicating that file does not exists even though SQL exported the file successfully in step 2. I even provided sleep time between each step ranging from 5 to 60 seconds but continued to see the same random behavior.

After spending sometime trying to diagnose what might be going on, ended up debugging NFS caching. The directory /dir_1 was a mounted file system with NFS caching set to few hundred seconds. When the application wrote to NFS directory, the write cache was updated but not the OS directory structure (inode). Reducing the parameter setting (actimeo) to lower number, say 30 seconds, will help alliviate the delay. If sys admins are reluctant to change the older mounted system settings, you should get a new mount point with actimeo set (30).   Once these changes were made application was able to run smoothly with the application sleep set to little higher than actimeo timings. Note, using actimeo sets all of acregmin, acregmax, acdirmin, and acdirmax to the same value. There is no default value. See man pages for more details.

Cheers,
Shiva

Hourly web log analysis through Hadoop

Many a time one may want to parse the weblogs for doing some quick analysis on AB tests or for security/fraud alerts or recent advertisement or campaign analysis. There are many applications or utilities that perform web log analysis but more often than not regular expressions provide a powerful and elegant ways to analyse these logs and especially handy when one is dealing with massive and quickly rotating web logs. Check out this wiki for more general info on web analytics.

When each weblog is multi-gigabytes and moved to archive every couple of hours and there is farm of web servers (hundreds or thousands of web servers), many of the vendor or 3rd party applications don't scale up either. Hadoop streaming with simple utilities can provide insights into what otherwise would have been a costly experiment.

Here is an regular expession used to extract HOUR and the string that is of interest from the Apache web logs. Each entry in the log has the format similar to the ones below.

01/Jun/2010:07:09:26 -0500] - 127.0.0.1 "GET /apache_pb.gif HTTP/1.1" status=200 size=2326 ab=test1AB ....

01/Jun/2010:07:09:26 -0500] - 127.0.0.1 "GET /apache_pb.gif HTTP/1.1" status=200 size=2326 ab=test-2CD ....

For more details on Apache log configuration, take a look at its specs and custom logs. Along with time of the day of the site hit, user IP address, http request (GET or POST), page requested, protocol used, one can configure the web server to log many more details including referer, user agent (browser), environment variables, etc.

Conversation Prism - An Image

As the social media, social networking, advertising, Internet marketing continue to evolve with new technologies and many companies create their own social groups, it all gets more complex and confusing. Many a times a picture or an image will explain more elegantly than 1000 words are more and in some cases image is the most suited tool to explain. Here is one image created by Brian Solis & Jesse Thomas that I like in this conversation!

Email parsing with regular expressions

Emails always pose interesting challenges to load and handle for doing any kind of analysis with them - analysing number of users from email providers like Yahoo, Google, Hotmail, etc. or finding top country specific domains or users with specific email pattern.

RFC 822, 2822 (April, 2001) and 5322 (October,2008) define the specification and syntax for text messages exchanged between computers. The RFCs cover not just the email address specifications but also the envelop and contents. Most of applications adhere to subset of these specs with more stringent email addresses. For example, though address specs allow for ALPHA (a-z, A-Z), DIGITS (0-9) and special characters like "!", "#", "$", "%", etc. and even more special characters with quoted strings. But typically email service providers won't allow you to create an email address with special characters like "$" or "!", etc. For more spec details see section 3.4 and 3.4.1 in rfc5322. RFC 3696 explains in much easier terms email addresses, URI and HTTP URL.

One can load email addresses in to staging area of data warehouse (DW) and then perform validation and cleaning through ETL's in-built utilities or components. But to handle many of different possibilities and ability to quickly update or modify any new format encountered, it is better to validate and clean it through application programs before loading into staging or final tables. In this way the load can also be distributed through file split and multi-threading or processes running on different systems.

Following regular expression pulled from regular-expressions.info handles all most all email addresses (more than 99.9%) and you can tweak to it suit your needs for performance or handling specific emails only. For example, you may not want to allow any email that has special characters like #, %, $, {, }, /, *, etc. I have explained below in detail how this RE parses an email.

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)\b

Same as above with color coding for further explanation:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)\b

Address mainly has two parts - one before (part 1) "@" and one after it (part 2).

Part 1:

[a-z0-9!#$%&'*+/=?^_`{|}~-] -- A set of characters i.e., between square brackets [ ], and in this case all letters "a to z" (either upper case or lower case) and numerals "0 to 9" and special characters "!#$%&'*+/=?^_`{|}~-" are allowed
+ -- One or more times of the above character set. Meaning "joe" or "!j!" or "abc" are all valid i.e., Alphabets and ! are repeated one or more times. But Part1.1 will not match "(abc)" or "joe<>" because "(", ")", "<", ">" are not in the above character set.
( -- Allow grouping also start remembering the matched string. To avoid storing use "?:" right after "(".
?: -- Don't need to remember or store what is matched. When grouping is done through "(", it is also remembered in variables $1, $2, $3, etc. With "?:" following "(" indicates that not to store it. This will help with the performance but if you want the matched string to use for further processing, then you can remove "?:" from above regular expression.
\. -- It indicates "." can appear in email after bullet 1 above but "." has significance for the regular expression (RE) itself. In RE "." means any single character so we need to escape it and pass it "." down to parsing engine.
[a-z0-9!#$%&'*+/=?^_`{|}~-] -- See 1 above.
+ -- See 2 above. Also, if a "." appears then there must be at least one of characters in 6.
* -- Means zero or more times. In this case all characters in step 5, 6 & 7 can be optional. That is "joe.a" is valid but not "joe."
@ -- "@" sign is a must in the email address. It should appear once.

Part 2:

Part 2A: (?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+

+ -- All with the outer parenthesis must appear at least once. It enforcing rule that after "@" sign in email, there must be at least one charcter.
?: -- Don't need to remember or store what is matched. Helps in performance when dealing with millions of email parsing.
[a-z0-9] -- First character after the "@" must be a alpha-numeral (a,b,c...z or 1,2,...9)
(?:[a-z0-9-]*[a-z0-9])? -- "?" at the end indicates 0 or 1 time of alpha-numeral characters. That is, after a first character in step 3 next characters must be alpha-numerals only but they are optional.
\. -- See step 5 in part 1.

Part 2B: (?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)

[A-Z]{2} -- Any two characters after the final ".". This covers country top level domains (TLD) like "uk" (for United Kingdom), "in" (India), "ca"(Canada), etc. See this list of standard country codes.
| -- Is a "OR" operator. It matches "com" or "org" or "museum", etc.

And final "\b" at the end is a word boundary anchor.

During this validation and clean up you can also add length check on the email part 1 and 2. RFCs specify of length 64 characters (octet) for part 1 (before @ character) and 255 characters (octet) after @ symbol. These implementations can effectively handle more than 99.999% of emails.

Let me know if you have different regular expression or different way of parsing emails.

HTH,
Shiva

Data Profiling - An example in Talend Profiler

Data is loaded into Data Warehouse (DW) from disparate systems and sometimes from external partners who have their own coding standards. In any case, the quality of the data loaded into the data warehouse is often variable,and especially while discovering the data one may need to load some sample data and do some analysis including initial data profiling. During this process one may discover differences which when resolved result in much smoother data flow along the process flow. Or it may be at later stage, say after the summarization is completed one may need to do some analysis on type of data. In all these cases data profiling helps and Talend provides a tool (Talend Open Profiler - TOP) to quickly and efficiently perform profiling.

Data profiling - the process of examining available data in different data sources, including databases, applications, files, data transfer from external systems etc., and collecting statistics and information - improves data quality and better reporting.

In date dimension, we have nearly 220,000 rows covering Jan.01,1900 to Dec.31,2500 (7 hundred year dates) and one of the column is 'day_of_week_name' (cardinality 7 - MONDAY, TUESDAY....). This table has 70 columns including date, weeks, months, names, etc. For testing purpose, I wanted to check the nulls and pattern frequency (distribution) for 'day_of_week_name' column.

To do so, select the column to profile (day_of_week_name), drag and drop into "Analyzed columns" of "Analysis Settings" tab. Then pick the indicators i.e., how you want the column measured (count, range, stats, etc.) and I picked row count and NULL count along with "Pattern Frequency Table". Pattern frequency will count different patterns. The results in "Analysis Results" tab shows as below.

Logging queries for performance monitoring

Before making any performance improvements, one need to measure it either you are trying to improve computer applications or DW queries, etc. In one of the Data Warehouse subject area where I run more than 120K (120,00) queries everyday in an ETL environment, I log all queries to a table and track over time to see which queries are showing performance issue(s). With the following table, I was able to improve the performance by more than 50% some time. For a query that repeatedly runs with different parameters in where clause, a small improvement adds up quickly and other times due to data skewing you can clearly see any changes needed to improve performance. Other times any mistakes in indexes (DBA dropped it :) or application user modified their query, etc. you will have a reference point to check against why a particular query or queries are slow.

Query Performance Monitoring table definition
Column	Null?	Type	Comment
ID	No	int (11)	Auto Increment
RUN_NUMBER	No	tinyint	Useful when job is recovered or re-ran
APPLICATION	No	varchar(128)	Application name to track
DOMAIN	Yes	varchar(128)	Application domain or subject area
QUERY	No	varchar(4096)	Actual query
DURATION	No	decimal(10,5)	How long did query run?
ROW_COUNT	Yes	int	Number of rows affected
RESULT	No	varchar(32)	Query result type - succeeded or failed?
COMMENT	Yes	varchar(128)	User comment that can be logged with each query
CREATED_DT	No	date	Query run date
CREATED_DT_TM	No	timestamp/ datetime	Query run date and time
CREATED_BY	Yes	varchar(64)	Query user name

Google Analytics Plugin - Very useful one

Google Analytics (GA) is invaluable when you need to slice and dice web traffic data along numerous dimensions. For example, by date, time, location just to name few measuring visits, vistiors, pageviews, etc. For full list of possible combinations take a look at this documentation.

While many business users and SEO/SEM analyts use GA on regular basis to manage their Internet Marketing spend and improve site traffic for better ROI, they can chart a single measure for different segments and analyse one metric at a time. Many a time users download their data from adwords or get GA data through their in-house API setup. With almost all business users having used Excel as one of their main analytical tool, they prefer it over other tools (either in-house or 3rd party). Also, as quickly as things change in Internet Marketing (IM) world, some quick analysis business user can do on their own allows them to respond to quickly to market changes. You may find bullet 3 and 4 in this article from Avinash Kaushik relevant.

Special characters in table names

In DW environment, we allow business user to upload data into new tables and run their own adhoc queries when needed. The user interface allows the user to upload up to 10,000 (10K) rows in either comma or tab delimited format to a temporary database/table. The user can specify the table name they want the data to upload to. Many savvy business users can thus run their own SQL queries joining with the data in DW tables.

All this works fine but many a times user provides table names with non-alphanumeric characters including space, forward or backward slash, dollar sign, etc. At some point DBAs noted that managing these tables becoming an issue especially the tables names with special characters. The following monitoring tool helped alleviate or eliminated the problem and also the UI was modified to check for any special characters and remove them before creating users' temporary tables.

Since MySQL doesn't provide regular expression (RLIKE or REGEXP) in SHOW TABLE syntax one is forced to use UNIX or other language (awk, perl, python, etc.) for RE features.

The following pulls all table names with special characters including $, \, space, ?, and @. You can add other characters that are of interest to you to this set.

linux> mysql --host=hostname_or_ip  --database=db_name   --user=root  --password=your_pw  --execute="SHOW TABLES" | egrep [$\\\ ?!@]

# This shows tables like (not including double quotes)
"non_alphanum_table_name test it 2"
"non_alphanum_table_name; test it"
"non_alphanum_table_name; test it $"
"non_alphanum_table_name; test it $ \"
"non_alphanum_table_name_test_$_!_2"

Getting MySQL DB size

Many a times you want to find the size of database (schema) for performance monitoring or to allocate more disk space or to take some proactive action on db size. You can pull relevant information from MySQL INFORMATION_SCHEMA with the following query. It sums up each table data and its index size for all tables in a single schema.

SELECT s.schema_name,
  CONCAT(IFNULL(ROUND((SUM(t.data_length) +
       SUM(t.index_length))/1024/1024,2),0.00),"Mb") total_size,
  CONCAT(IFNULL(ROUND((SUM(t.data_length))/1024/1024,2),0.00),"Mb") data_size,
  CONCAT(IFNULL(ROUND((SUM(t.index_length))/1024/1024,2),0.00),"Mb") index_size,
  CONCAT(IFNULL(ROUND(((SUM(t.data_length)+SUM(t.index_length)) -
         SUM(t.data_free))/1024/1024,2),0.00),"Mb") data_used,
  CONCAT(IFNULL(ROUND(SUM(data_free)/1024/1024,2),0.00),"Mb") data_free,
  IFNULL(ROUND((((SUM(t.data_length)+SUM(t.index_length)) -
       SUM(t.data_free))/((SUM(t.data_length) + 
       SUM(t.index_length)))*100),2),0) pct_used,
  SUM(t.table_rows) total_rows,
  COUNT(t.table_name) total_tables,
  CONCAT(ROUND(MAX(t.data_length+t.index_length)/1024/1024, 0.00), 
      "Mb") biggest_table_size,
  GROUP_CONCAT(distinct t.engine) engines_used,
  GROUP_CONCAT(distinct t.table_collation) collations_used
FROM INFORMATION_SCHEMA.SCHEMATA s
  LEFT JOIN INFORMATION_SCHEMA.TABLES t 
ON s.schema_name = t.table_schema
WHERE s.schema_name = "test_dw_advertising"
GROUP BY s.schema_name\G


*************************** 1. row ***************************
       schema_name: test_dw_advertising
        total_size: 212151.14Mb
         data_size: 154660.83Mb
        index_size: 57490.31Mb
         data_used: 210039.45Mb
         data_free: 2111.69Mb
          pct_used: 99.00
        total_rows: 309940227
      total_tables: 118
biggest_table_size: 74360Mb
      engines_used: InnoDB,MyISAM
   collations_used: latin1_swedish_ci
1 row in set (1 min 38.26 sec)

The total size of allocated (including free data size) is nearly 212G, out of which data used ~155G and indexes used ~57G for 118 tables leaving 2G free (99% of disk allocated is used). Biggest table used 74.3G disk space and this schema has both InnoDB and MyISAM tables using single collation latin1_swedish_ci.

HTH,
Shiva

Talend - Oracle - Java Path error

Few weeks ago on of our QA systems I installed Oracle 10g Express for testing and installation was smooth and all went well. But recently, I needed to do some data profiling in a specific table in MySQL DW environment due to fact that some of queries were taking more than twice the amount of normal runs. Specifically, one query that used to take 5-6 minutes churning through couple of million rows in a staging table started taking more than 15 minutes. I suspected some data issue but needed to quickly run some tests knowing how the recent data profiles against earlier weeks data.

I downloaded Talend Data Profiler to this new test machine (where I had Oracle 10g) and tried to install it. Soon after double clicking on exe file, I got an error indicating something like "org.talend.rcp.branding.top.product could not be found." in configuration\12725000123.log. The log also has some more information regarding which Java version it is trying to use and command line options passed to run it. The error itself doesn't indicate much and this wasted quite a bit of time.

Going through manual of Talend I found out that profiler needs Java version 1.5 or later version. But I was pretty sure that I had the latest Java environment on this box and checked twice by running java -version in cygwin and checking the path. Also by visiting http://www.java.com then clicking on "Verify Installation" button. They both indicated latest version.

Yahoo SEM isBooksClosed

As a part of Internet Marketing and SEM, you will need to pull daily summary of keyword performance reports from search engines like Google, Yahoo and MSN.

In case of Yahoo, you should check "isBooksClosed" to get the correct data for your report dates; meaning the data you pulled will not change if isBooksClosed is true and Yahoo analytics has collected all the data for the date you requested the report.

This is especially important if you are pulling the previous day data early in the morning as Yahoo may not have had time to get collect and complete the analytics. We have run into many times books not closed condition for few accounts out of many tens of accounts. We then need to repull the data at later time of the day and run the daily summaries.

I have reproduced the part of Yahoo explaination related to this from http://developer.searchmarketing.yahoo.com/docs/V6/gsg/BasicReportService.php#books_closed

Google's extended searches

Internet search users are very well aware of Google search engine and typically they end up entering their query or keywords in the search box and click the resulting links of interest/relevant to them (typically first page results and especially among top 3 o 4 results). Large percentage of users seem to be unaware of Google's extended search that can provide exact result he or she is looking for without having to hop to one more site.

For example, weather for a particular city or current time at a location or dictionary or area code or sports scores and many more. Following are few examples.

For weather: To find current weather at Boston, USA just enter keywords "weather Boston, USA" and you will see today and next few days weather in the first result!

For time: To find current time at Bangalore, India (or Bengaluru, India) enter "time Bengaluru, India" and first result is the time.

For dictionary: use "define" keyword first before entering your query like "define avatar"

Yahoo SEM daily report - Currency and Dups handling

If you are pulling daily keyword performance data from major search engines to store in your in Data Warehouse, you need to perform many of transformation before having a daily, weekly or monthly summaries available for business (end) users. In the case of Yahoo the required transformations that I needed to make are, one to handle the currency conversion and two to handle Yahoo sending more than one keyword when only one is expected. This has seem to happen when query or landing page url is modified for the same bid unit and for high impressions queries. Also, for disabled queries/keywords.

The following sql handles both with data read from staging table and loaded into final table after some transformation and cleaning. It does currency conversion for all countries except US and Mexico (units are in US dollars) which do not need conversions.

Sub-query below uses MySQL GROUP_CONCAT to find any duplicate keywords for the bid unit and dim_curreny has the daily currency exchange rates pulled from external third party services early in the morning. se_query_group table has the grouping of the keywords based on company requirements and search engine. Query also filters out any keyword that has zero impressions.

The query can further be modified for better performance of using query_id by joining with data warehouse query tables and using query_id in the joins also using search engine bid unit as the surrogate key.

Ruby libmysql.dll or mysql_api.so error

Installing Ruby MySQL module is pretty straight forward either on *nix or Windows by running

> gem  install  mysql

But on Windows when you try to import the module through "require mysql" you may get an error like

C:/Ruby/lib/ruby/gems/1.8/gems/mysql-2.8.1-x86-mswin32/lib/1.8/mysql_api.so: 126: The specified module could not be found.   - C:/Ruby/lib/ruby/gems/1.8/gems/mysql-2.8.1-x86-mswin32/lib/1.8/mysql_api.so (LoadError)
#  from C:/Ruby/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require'
...
...

In that scenario, you are missing "libmysql.dll" in the system32 directory. In my case I had MySQL server copy and had the dll in folder C:/Program Files/MySQL/MySQL Server 5.4/lib/opt

Mapping URL to top level domain

When parsing web logs as part of web analytics, one might be interested in the top level domain analysis before drilling down further. Following query maps an url (referer) to top level domain. Top level domain include not only generic ones like '.com', '.edu', '.org', etc but also country level top levels like '.uk', '.de', etc. You can download the list from IANA (Internet Assigned Numbers Authority) .

After downloading it import into a table dim_int_mktg_tld which you can join with the staging data.

CREATE TABLE `dim_int_mktg_tld` (
  `tld` varchar(16) NOT NULL,
  `type` varchar(32) default 'country-code',  #-- Can be country-code or generic
  `info` varchar(500) default NULL,
  PRIMARY KEY  (`tld`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

#--  Join with yesterdays staging data and get TLD
SELECT WL.url, TLD.tld
FROM stg_weblog_date WL
LEFT OUTER JOIN dim_int_mktg_tld TLD
ON substring_index(
     substring_index(
       substring_index(url, '://', -1),
       '/',1), 
    '.',-1) = TLD.tld
WHERE WL.date = DATE_ADD(CURRENT_DATE, interval -1 day)

#-- Example: http://www.abc.co.au/in/test
#-- Innermost substring_index returns everything to the right of '://' 
#--   i.e, www.abc.co.au/in/test,
#-- then the middle one gets everything to the left of first '/'
#--   i.e, www.abc.co.au
#-- then the outer most substring_index returns everything to the right of last '.'
#--   i.e, au
#-- which is then joined with TLD table

HTH,
Shiva

Download Google keywords through SUDS

There are sometimes situations where you may want to download all the keywords (criteria) uploaded to Google Adwords and if you have substantial keyword basket (say, more than a million), then either downloading manually or checking through the tools provided by Google become prohibitive. Also, Google daily keyword performance report provides metrics for all the keywords that have had at least one impression.

Following application is a simple tool that downloads all keywords and dumps that data to different files. I have removed error handling code and hardcoded many values to simplify it and you can modify this your requirements and don't forget to add your account and API details. I have been able to download few million keywords using multiple processes.

Environment:
Ubuntu karmic OS running Python 2.6.4 for Adwords API version 13 (V13) using Python SUDS SOAP module.

fizzbuzz in Ruby, Python, Perl, more

Recently, I ran into couple of articles on codinghorror.com that covered interviews with programmers/ software engineers. Many of the commentators answered the programming (fizzbuzz) question using their favorite computer language like VBA, Ruby, Perl, etc. Since the post was no more allowing any new comments, I thought of posting them here on my blog itself.

fizzbuzz question: For all numbers between 1 and 100, print 'fizz' any number divisible by 3, 'buzz' for number divisible by 5 and 'fizzbuzz' for number divisible by both 3 & 5.

Article 1: Why Can't Programmers.. Program?
Article 2: The Non-Programming Programmer

Adding disk space to Ubuntu (guest OS)

Recently, in one of my VMWare system (with Windows XP host OS and Ubuntu guest OS), I ran into disk space shortage while installing a package (Thrift - A cross language service). Though there were many articles on the Internet in helping extending the guest OS partitions, they were for older versions or not applicable any more or had many complex steps. On spending some time and digging more on the issue, I found pretty straight forward steps in achieving what I wanted to do - doubling the disk space for Ubuntu.

Initial setup:
Host OS: Windows OS
Guest OS: Ubuntu Karmic (had 8G pre-allocated)
Disk: 148G Total and 60G Free out of which 8G to be allocated to Ubuntu

Needed:
Double Ubuntu disk space to 16G by adding a new partition

Step 1:

Run "disk analyzer" and get current disk space usage data that provides total, used, available and usred. See below. The system has less than 10% free disk space and needs more.

MySQL SP parameter for IN clause

Recently, I came across a question on stackoverflow.com where the poster indicated that he was not able to use the parameter string as part of stored procedure IN clause to get the right results and the result set always returned one row only.

Test table used for procedure:

CREATE TABLE cities (
  id int(10) unsigned NOT NULL auto_increment,
  name varchar(100) NOT NULL,
  PRIMARY KEY  (`id`)
);
insert into cities (name) values 
 ('London'), ('Manchester'), ('Bristol'), 
 ('Birmingham'), ('Brighton');

Log Parsing through Hadoop, Hive & Python

One of the primary analysis done on web access logs is some cohort analysis where one need to pull user access date time and along with other dimensions like user, ip, geo data, etc. Here I will be using Hadoop/ Hive/ Python to pull date, ip data from access log into Hadoop and run some queries. The example illustrates using Hadoop (version 0.20.1) streaming, SERDE, Hive’s (version 0.40) plugin customer mapper (get_access_log_ip).

The steps below load few thousand rows into a target table (dw_log_ip_test – data warehouse access log) “access_log_2010_01_25” then extract date from format like DD/Mon/YYYY:HH:MM:SS -800 to ‘DD/Mon/YYYY’ along with remote ip address through a Python streaming script.

Step 1: First create a table to access log (access_log_2010_01_25) and then load data into it.

hive> 
CREATE TABLE access_log_2010_01_25 (
  request_date STRING,
  remote_ip STRING,
  method STRING,
  request STRING,
  protocol STRING,
  user STRING,
  status STRING,
  size STRING,
  time STRING,
  remote_host STRING,
  ts STRING,
  perf STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
"input.regex" = "\\[([^]]+)\\] ([^ ]*) \"([^ ]*) ([^ ]*) ([^ \"]*)\" user=([^ ]*) status=([^ ]*) size=([^ ]*)  time=([^ ]*) host=([^ ]*) timestamp=([^ ]*) perf=([^ ]*)",
"output.format.string" = "%1$s %2$s \"%3$s %4$s %5$s\" user=%6$s status=%7$s size=%8$s  time=%9$s  host=%10$s timestamp=%11$s  perf=%12$s"
)
STORED AS TEXTFILE;


hive> LOAD DATA LOCAL INPATH '/mnt/web_ser101/weblog_server101_20100125_1'   
    >   OVERWRITE INTO TABLE access_log_2010_01_25;
#- After load the data in one of the record would look like:
#- 25/Jan/2010:13:14:05 -0800      123.123.123.123   GET     /xmls/public/thumbnail.xml   HTTP/1.1        -       302     250     0  abcd.com   1264454045    -

Hive Metastore Derby DB

If you are using Hive in default mode, you may see the following behavior - you get to hive client from different directories and see different results when you run a query like "show tables". For example, you have hive installed in /usr/local/hive and your are currently in your home directory and run

~> /usr/local/hive/bin/hive #-- get to hive
hive> create table new_table (c1 string);
hive> show tables;

Now you will see "new_table" in the list.

~> cd /tmp
/tmp> /usr/local/hive/bin/hive #-- get to hive
hive> show tables;

Now you don't see "new_table" in your list of tables. Those who come from typical SQL background may find it little weird in the beginning due to fact that results seem different depending on from where you started the hive client. The reason is because hive uses "embedded Derby" database to store meta data and one of the default configuration property is to use the current directory to store metastore_db.

On starting the hive from two different directories like above, one would see two "metastore_db" directories created in home (~) and /tmp directories. You can change this and use one metastore_db by updating "/usr/local/hive/conf/hive-default.xml" file's "javax.jdo.option.ConnectionURL" as below.

Default setting:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

Update it to:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/hadoop/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

"/home/hadoop" is an example and one can appropriately change it to suitable host and directory. Say,
<value>jdbc:derby:;databaseName=//localhost:1234/metastore_db;create=true</value>

Cheers,
Shiva

Installing MySQL Python Package on Windows

I work in *nix environment but sometimes you may want to test some features on your own Windows desktop either due to access constraints on test or development systems. For example, company not allowing to install the latest packages or non-standard tools or you may be lazy and have setup your own little world on a desktop to do your tests, etc. Anyway, I needed to install Python MySQL package and installing that package is not as straight forward as installing it on Linux systems.

> python setup.py install

The above one typically takes care of it in *nix environment. But in windows you may see following error
"Unable to find vcvarsall.bat" and if so you should download and install Microsoft Visual C++ Express Edition. Installing whole Visual SDK kit seem like like a overkill for getting one .BAT file.

Now to install MySQL Python package, download it from sourceforge , gunzip it, untar it and make the following change before running install.

Hadoop Cheat Sheet

Cheat sheet for Hadoop version 0.20.1 (Jan.2010).

Data loading through Talend ETL Studio

In couple of my earlier posts I talked about loading data pulled from search engines and then loading into staging tables of a data warehouse. You can load data in couple of ways including databases bulk load tools - for example, MySQL "LOAD DATA INFILE " statement or using Talend's ETL studio or similar tools (Pentaho, Informatica, etc.)

In this article, I will be loading a simple data file with more than 200,000 (200K) rows of keyword performance data for illustration purpose and the setup used was
MySQL 5.1 (Community version)
Talend Open Studio (3.2.2)
Perl 5.10.1

Other posts that may be of interest:
1. URL parsing in Python, Perl, MySQL - an example
2. Google Adwords Data Retriever
3. Google Analytics Data Retriever

The data flows through 6 components (tFileInputDelimited, tPerlRow, tSortRow, tFileOutputDelimited, tFilterRow and tMysqlOutput) starting at input data file read and transformed into another output file and then loaded into database. Component #3 (sorting component) is not needed but shown here to stress the ease with which a developer can quickly put together a ETL process. Palette tab on the right hand side houses these components and many more. If you use Java instead of Perl, you will likely have more components to use and for list of all available components take at Talend Forge.

Input record format is a tab seperated fields with one record per line and so is output file in component 4.

# Typical fields in the file are date, account_id, campaign_id, ad_group_id, keyword_id, CPC, CTR, Cost, Impressions, url etc.
# 2009-12-12   9876543210   34987650   1147382910   1234567890  0.07  
# 0.226632  247.98   15444  http://hiregion.com/2009/10/google-analytics-data-retriever-api-sem_25.html

I provide the following regular expression to parse each URL to extract domain name in component 2.
/(https?://)?(www\.)?([a-zA-Z0-9_%]*)\b\.[a-z]{2,4}(\.[a-z]{2})?/ig

And this parses any URL with correct formats like http://www.amazon.com or https://amazon.com or http://www.amazon.co.uk or amazon.com and more.

Component 4 sorts the input stream coming from the regexp output and dumps into a output file which is then read into a filter which filters out any record that doesn't have the primary key before loading into staging database. You can load into database without having the primary key constraint to further speed it up and cleanup the duplicates before pushing into either dimensions or facts tables.

The whole run took little over a minute (~63 seconds) running on a test system running all three (MySQL, Talend, Perl). For daily load, you can make use of scheduler to kick-off the job and automate the loading process.

Hope that helps,
Shiva

URL parsing in Python, Perl, MySQL

There are many situations when one needs to parse the URL. It could be in web log parsing or Google Keyword or other performance data that is downloaded through APIs. Application languages like Python, Perl, PHP etc provide many modules to parse or you can use raw regular expression to parse the URLs. Below I have shown few ways of doing it with a simple example - either in applications before loading into staging area of data warehouse or parsing it in SQL (MySQL) after the load.

In the example url http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html, I would like to get sub-domain name and file name parts i.e; www.hiregion.com and sem-google-aswords-api-data-retriever.html.

Other posts that may be of interest:
1. Regular expressions - An intro
2. 99-bottles of bear in many languages

import time
import re