Data and Technology: January 2010

Hive Metastore Derby DB

If you are using Hive in default mode, you may see the following behavior - you get to hive client from different directories and see different results when you run a query like "show tables". For example, you have hive installed in /usr/local/hive and your are currently in your home directory and run

~> /usr/local/hive/bin/hive #-- get to hive
hive> create table new_table (c1 string);
hive> show tables;

Now you will see "new_table" in the list.

~> cd /tmp
/tmp> /usr/local/hive/bin/hive #-- get to hive
hive> show tables;

Now you don't see "new_table" in your list of tables. Those who come from typical SQL background may find it little weird in the beginning due to fact that results seem different depending on from where you started the hive client. The reason is because hive uses "embedded Derby" database to store meta data and one of the default configuration property is to use the current directory to store metastore_db.

On starting the hive from two different directories like above, one would see two "metastore_db" directories created in home (~) and /tmp directories. You can change this and use one metastore_db by updating "/usr/local/hive/conf/hive-default.xml" file's "javax.jdo.option.ConnectionURL" as below.

Default setting:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

Update it to:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/hadoop/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

"/home/hadoop" is an example and one can appropriately change it to suitable host and directory. Say,
<value>jdbc:derby:;databaseName=//localhost:1234/metastore_db;create=true</value>

Cheers,
Shiva

Installing MySQL Python Package on Windows

I work in *nix environment but sometimes you may want to test some features on your own Windows desktop either due to access constraints on test or development systems. For example, company not allowing to install the latest packages or non-standard tools or you may be lazy and have setup your own little world on a desktop to do your tests, etc. Anyway, I needed to install Python MySQL package and installing that package is not as straight forward as installing it on Linux systems.

> python setup.py install

The above one typically takes care of it in *nix environment. But in windows you may see following error
"Unable to find vcvarsall.bat" and if so you should download and install Microsoft Visual C++ Express Edition. Installing whole Visual SDK kit seem like like a overkill for getting one .BAT file.

Now to install MySQL Python package, download it from sourceforge , gunzip it, untar it and make the following change before running install.

Hadoop Cheat Sheet

Cheat sheet for Hadoop version 0.20.1 (Jan.2010).

Data loading through Talend ETL Studio

In couple of my earlier posts I talked about loading data pulled from search engines and then loading into staging tables of a data warehouse. You can load data in couple of ways including databases bulk load tools - for example, MySQL "LOAD DATA INFILE " statement or using Talend's ETL studio or similar tools (Pentaho, Informatica, etc.)

In this article, I will be loading a simple data file with more than 200,000 (200K) rows of keyword performance data for illustration purpose and the setup used was
MySQL 5.1 (Community version)
Talend Open Studio (3.2.2)
Perl 5.10.1

Other posts that may be of interest:
1. URL parsing in Python, Perl, MySQL - an example
2. Google Adwords Data Retriever
3. Google Analytics Data Retriever

The data flows through 6 components (tFileInputDelimited, tPerlRow, tSortRow, tFileOutputDelimited, tFilterRow and tMysqlOutput) starting at input data file read and transformed into another output file and then loaded into database. Component #3 (sorting component) is not needed but shown here to stress the ease with which a developer can quickly put together a ETL process. Palette tab on the right hand side houses these components and many more. If you use Java instead of Perl, you will likely have more components to use and for list of all available components take at Talend Forge.

Input record format is a tab seperated fields with one record per line and so is output file in component 4.

# Typical fields in the file are date, account_id, campaign_id, ad_group_id, keyword_id, CPC, CTR, Cost, Impressions, url etc.
# 2009-12-12   9876543210   34987650   1147382910   1234567890  0.07  
# 0.226632  247.98   15444  http://hiregion.com/2009/10/google-analytics-data-retriever-api-sem_25.html

I provide the following regular expression to parse each URL to extract domain name in component 2.
/(https?://)?(www\.)?([a-zA-Z0-9_%]*)\b\.[a-z]{2,4}(\.[a-z]{2})?/ig

And this parses any URL with correct formats like http://www.amazon.com or https://amazon.com or http://www.amazon.co.uk or amazon.com and more.

Component 4 sorts the input stream coming from the regexp output and dumps into a output file which is then read into a filter which filters out any record that doesn't have the primary key before loading into staging database. You can load into database without having the primary key constraint to further speed it up and cleanup the duplicates before pushing into either dimensions or facts tables.

The whole run took little over a minute (~63 seconds) running on a test system running all three (MySQL, Talend, Perl). For daily load, you can make use of scheduler to kick-off the job and automate the loading process.

Hope that helps,
Shiva

URL parsing in Python, Perl, MySQL

There are many situations when one needs to parse the URL. It could be in web log parsing or Google Keyword or other performance data that is downloaded through APIs. Application languages like Python, Perl, PHP etc provide many modules to parse or you can use raw regular expression to parse the URLs. Below I have shown few ways of doing it with a simple example - either in applications before loading into staging area of data warehouse or parsing it in SQL (MySQL) after the load.

In the example url http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html, I would like to get sub-domain name and file name parts i.e; www.hiregion.com and sem-google-aswords-api-data-retriever.html.

Other posts that may be of interest:
1. Regular expressions - An intro
2. 99-bottles of bear in many languages

import time
import re