Data and Technology: February 2010

fizzbuzz in Ruby, Python, Perl, more

Recently, I ran into couple of articles on codinghorror.com that covered interviews with programmers/ software engineers. Many of the commentators answered the programming (fizzbuzz) question using their favorite computer language like VBA, Ruby, Perl, etc. Since the post was no more allowing any new comments, I thought of posting them here on my blog itself.

fizzbuzz question: For all numbers between 1 and 100, print 'fizz' any number divisible by 3, 'buzz' for number divisible by 5 and 'fizzbuzz' for number divisible by both 3 & 5.

Article 1: Why Can't Programmers.. Program?
Article 2: The Non-Programming Programmer

Adding disk space to Ubuntu (guest OS)

Recently, in one of my VMWare system (with Windows XP host OS and Ubuntu guest OS), I ran into disk space shortage while installing a package (Thrift - A cross language service). Though there were many articles on the Internet in helping extending the guest OS partitions, they were for older versions or not applicable any more or had many complex steps. On spending some time and digging more on the issue, I found pretty straight forward steps in achieving what I wanted to do - doubling the disk space for Ubuntu.

Initial setup:
Host OS: Windows OS
Guest OS: Ubuntu Karmic (had 8G pre-allocated)
Disk: 148G Total and 60G Free out of which 8G to be allocated to Ubuntu

Needed:
Double Ubuntu disk space to 16G by adding a new partition

Step 1:

Run "disk analyzer" and get current disk space usage data that provides total, used, available and usred. See below. The system has less than 10% free disk space and needs more.

MySQL SP parameter for IN clause

Recently, I came across a question on stackoverflow.com where the poster indicated that he was not able to use the parameter string as part of stored procedure IN clause to get the right results and the result set always returned one row only.

Test table used for procedure:

CREATE TABLE cities (
  id int(10) unsigned NOT NULL auto_increment,
  name varchar(100) NOT NULL,
  PRIMARY KEY  (`id`)
);
insert into cities (name) values 
 ('London'), ('Manchester'), ('Bristol'), 
 ('Birmingham'), ('Brighton');

Log Parsing through Hadoop, Hive & Python

One of the primary analysis done on web access logs is some cohort analysis where one need to pull user access date time and along with other dimensions like user, ip, geo data, etc. Here I will be using Hadoop/ Hive/ Python to pull date, ip data from access log into Hadoop and run some queries. The example illustrates using Hadoop (version 0.20.1) streaming, SERDE, Hive’s (version 0.40) plugin customer mapper (get_access_log_ip).

The steps below load few thousand rows into a target table (dw_log_ip_test – data warehouse access log) “access_log_2010_01_25” then extract date from format like DD/Mon/YYYY:HH:MM:SS -800 to ‘DD/Mon/YYYY’ along with remote ip address through a Python streaming script.

Step 1: First create a table to access log (access_log_2010_01_25) and then load data into it.

hive> 
CREATE TABLE access_log_2010_01_25 (
  request_date STRING,
  remote_ip STRING,
  method STRING,
  request STRING,
  protocol STRING,
  user STRING,
  status STRING,
  size STRING,
  time STRING,
  remote_host STRING,
  ts STRING,
  perf STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
"input.regex" = "\\[([^]]+)\\] ([^ ]*) \"([^ ]*) ([^ ]*) ([^ \"]*)\" user=([^ ]*) status=([^ ]*) size=([^ ]*)  time=([^ ]*) host=([^ ]*) timestamp=([^ ]*) perf=([^ ]*)",
"output.format.string" = "%1$s %2$s \"%3$s %4$s %5$s\" user=%6$s status=%7$s size=%8$s  time=%9$s  host=%10$s timestamp=%11$s  perf=%12$s"
)
STORED AS TEXTFILE;


hive> LOAD DATA LOCAL INPATH '/mnt/web_ser101/weblog_server101_20100125_1'   
    >   OVERWRITE INTO TABLE access_log_2010_01_25;
#- After load the data in one of the record would look like:
#- 25/Jan/2010:13:14:05 -0800      123.123.123.123   GET     /xmls/public/thumbnail.xml   HTTP/1.1        -       302     250     0  abcd.com   1264454045    -