Data and Technology

Nice chart usage for dynamic pricing

Real time analytics allows to track and monitor users and their activities; and then adjust what is presented to them. It could be a relevant advertisement like showing Nikon camera advertisement for a user searching for a new camera or camera prices or anything similar.

Taking one step further is when the price of an item(s) is adjusted depending on whether that user is a loyal customer and/or there is higher possibility that he/she may buy other accessories. Realtime predictive analytics makes this possible.

Below is an nice graph presented by Wall Street Journal and at http://www.ritholtz.com/blog/2012/09/lucky-us-toilet-paper-priced-like-airline-tickets/

Graph shows three companies' (Sears, BestBuy and Amazon) price variation over a day for a microwave. Amazon increased prices during the peak hours by more than 10% (~8am to 12.30pm and then again 3pm to ~9pm EDT). All times shown in graph are in PDT (Pacific Day Time) timezone.

Even more interesting will be observe whether prices were varied based on user location or where Amazon's servers were located? As it is simple to geo map the IP address of a user computer/device and vary the prices accordingly! Different users from different cities at different times will see different prices. The price points and user experience can be optimized for improved sales!

Using d3 visualization for fraud detection and trending

d3 (Data Driven Document) is a great data visualizing tool and recently used it to track possible fraud or how some metrics have behaved over few hours to over few weeks. You can filter a group that is of interest out of, say, million or more and then use d3 to work through manually for more unique set.

The graph is dynamic (unlike in this blog) where you can select range of lines by moving the cursor over any of Y-axes and selecting them. It is much more interesting in action than a snapshot of images below.

Below is users' "A" metric over 5 hours, 5 to 12 hours, 12 hours to 1 day, similarly up to last 14 to 28 days. The user group was selected from few different segments and each major color corresponds to a segment. d3 automatically varies the color slightly for each line. Though it is not very clear in the image below main colors were blue, green, orange and purple.

The input to d3 is a simple csv file and all the presentation is handled by d3 unlike in many previous packages I had used where I ended up creating a output file in html or some xml for flash. Big advantage with d3 over these is attaching the html element to each data point in the programming and in-built data visualizing functions do the rest of magic.

In the above scenario, for example, you can move the cursor to left most scale (5 hour) and zoom in on lines above 1,700 to 2,000. There is only one user well above the rest who have metric 200 or lower. This user hasn't done much in last 4 weeks until last 5 hours! Time to look into what this user is doing and we use other tools to do further analysis.

Similar to above scenario, below is another graph where I am interested in about all users whose score was between 600 and 1,400 over last 2 to 7 days. There is not much exciting in below graph and have seen more interesting details other times.

Happy data visualization!

Users Password Analysis

As a data engineer it is always interesting to work on large unique data sets. With recently released Yahoo users details (453K) many insightful info can be gleaned from the data. For example, even though password hacking is well known for long time still large number of users use simple passwords, sometime as simple as "password" or "123456" or similar. Here are top 10 passwords and number of users who had used them!

123456 1667
password 780
welcome 437
ninja 333
abc123 250
123456789 222
12345678 208
sunshine 205
princess 202
qwerty 172

It is interesting to see how many users had unique passwords which was not used by anyone in this data set. There were 10.6K users with no password which might be due to data issue and ignored for many of calculations and only ~304K (69%) users with unique passwords.

Another interesting insight is if password is used by more than one user, there is likely hood that it is some kind of latin word or words ("whatever", "iloveyou") or proper name ("jordon", "ginger") or some number (123321) or what can easily be guessed (for example, "q1w2e3r4" for qwerty keyboard or "asdfgh", etc.). Even when two users used the same password there was some certainty that it is a guessable password! With each additional user the certainty increases quite quickly. Under these circumstances, even if a password is encrypted (by md5 or sha or other encryptions) by service providers, with brute force application one can find out the password for these users.

By also looking into how users from different email service providers had their passwords setup showed the following. As expected, Yahoo had more users (x-axis) while smaller companies ("others" in the chart) had more number of users (71.7%) with unique passwords. At the same time gmail and live users' password length is more than 8.87. Length of the passwords is represented by size of the bubble.

Having bigger bubble size and higher up in the Y-axis is better as it represents more users using unique passwords with longer password strings. See table below for more details.

Even more interesting analysis can be done including people's or places' names in their password. One could be able to use popular names from US Social Security Administration's and names' list go as back as 1880! There were lot more passwords that simply used these names! Lot more matches can be found with minor modifications like changing i to 1 or o to 0 (zero), etc.

With many users using simple passwords service providers or websites should force each user to have stronger password by enforcing them during the registration or each login. Users should also be forced change them once in few months. It might be even better each computer equipped with finger or eye reader that can be used for user authentication thus avoiding this whole password mess.

Simple time series metric

Here is an example of a fusion chart recently created using MySQL, Perl, PHP and Fusion charts with fudged data. X axis is date & hour for last 7 days. The data is pulled from MySQL and generates a XML file for consumption by PHP and Fusion.

The time data is in UTC (unix timestamp and first second of each hour. For example, 1262307600 => 2010-01-01 01:00:01) summarized with other dimensions and metrics. The simplified fact table looks like

ts_begin_hr, dim_1, dim_2, metric_1, metric_2, etc.

The query used to pull required report data.

SELECT 'dim1' dim, concat(D_H.d, ' ', D_H.hr) pst,

CASE WHEN P.met1 IS NULL THEN 0.0 ELSE P.met1 end METRIC

FROM

(

SELECT d.d, h.hr

FROM

(SELECT '00' hr UNION SELECT '01' UNION SELECT '02' UNION SELECT '03' UNION SELECT '04'

UNION SELECT '05' UNION SELECT '06' UNION SELECT '07' UNION SELECT '08'

UNION SELECT '09' UNION SELECT 10 UNION SELECT 11 UNION SELECT 12

UNION SELECT 13 UNION SELECT 14 UNION SELECT 15 UNION SELECT 16

UNION SELECT 17 UNION SELECT 18 UNION SELECT 19 UNION SELECT 20

UNION SELECT 21 UNION SELECT 22 UNION SELECT 23

) h

CROSS JOIN

(SELECT full_date d FROM DW_DB.dim_date

WHERE full_date BETWEEN date_add(current_date(), interval -6 day) AND current_date()

) d -- 1

) D_H

LEFT OUTER JOIN

(

SELECT 'dim1' dim,

, date(convert_tz(from_unixtime(ts_begin_hr), 'GMT', 'America/Los_Angeles')) d

, hour(convert_tz(from_unixtime(ts_being_hr), 'GMT', 'America/Los_Angeles')) hr

, sum(met1) met1

FROM DW_DB.FACT_TABLE FT

JOIN DW_DB.DIM_TABLE DM

ON FT.dim_id = DM.dim_id

WHERE from_unixtime(ts_begin_hr) >= date_add(now(), interval -7 day)

GROUP BY dim, d, hr

) P -- 2

ON D_H.d = P.d and D_H.hr = P.hr

WHERE D_H.d < date(convert_tz(now(), 'GMT', 'America/Los_Angeles'))

OR (D_H.d = date(convert_tz(now(), 'GMT', 'America/Los_Angeles'))

AND D_H.d <= hour(convert_tz(now(), 'GMT', 'America/Los_Angeles'))

)

ORDER BY D_H.d, D_H.hr

;";

Subquery 1 (in blue) gets a cross-joined table of 24 hours (0..23) with last 6 dates resulting in output like YYYY-MM-DD HH. Single digit hours are converted to two digits in string format (SELECT '0n'). The system and mysql were running in UTC timezone while business users needed them in PST tz. Subquery 2 (in brown) generates required metric in PST and joined with subquery 1. With where clause any extra hours (later than current hour) is filtered out in left outer join.

Subquery 1 is joined with left outer join for possibility that during a given hour there might not be metric (say, dollar amount) available because there might not be any transaction during that hour for a given dimension.

The result set is dumped to XML file for Fusion chart consumption resulting in above output through php wrapper. The chart is updated every hour.

Visualizing daily metric with tiles

One of the effective way to present a time series data over long period of time is either typical line charts or some modified versions of it. It gets little harder to visualize when would like to do see clustering of data points. Here is one where I find it tiled series that gives quick glimpse of a some metric varying over few years. This is a daily metric starting January 01, 2010 and up to recent week of May, 2012; nearly 2 and 1/2 years of metric variation.

The Y-axis measures represent day of the week (wday) with 1 representing Monday, 2 Tuesday and so forth with 7 as Sunday. I set the values to these so that I could cluster weekends metrics (5,6 & 7) together vs the weekday (1,2,3,4 - Monday to Thursday). X-axis represent week of the year (1 to 52 or 53). Year 2010, 2011 and 2012 are series. Metric values varies from 0.00 to 10.00 and color of each tile varies slightly based on the metric value.

If you were to use discrete values of metrics say, 0 to 5 the color coding is similarly quite distinct. See graph 2 below. The data is from 2nd week of March, 2010 to May, 2012.

Graph 1

Graph 2

User traversal digraphs

Visualizing the users' traversal on a website or path traversed in an application or nodes visited may need to be analyzed for better user experience or improving the efficiency the path traversed or to improve the conversion rate by getting the user to the end state. For example, it may be that users visiting one set of nodes (node meaning particular state of application like tutorial state machine) convert better than users going through different set of states. There are many cases where a graph or directed graph is a necessity.

You can use freeware tools like ‘dot’ to draw directed graphs. In the following example, I assume the state/node is represented by simple integer and path taken by the edge (directed line).

Here, I have shown a simplified version of a project I worked on in quickly generating a digraph that can be sent or uploaded to site for internal business users on regular intervals in real time. Few assumptions are made to help explain the main idea and they are - each visit to the node is timestamped and user can only traverse forward (meaning visit to the node number higher than the current one). To start with data is summarized in a table with each user or device or some unique key identifier, node visited, and time of visit.

Order the data in chronological order for each unique id so that by doing the self-join (see code snippet below) in SQL we can simply find out the next node the user visited.

SELECT T1.node_id current_node, T2.node_id next_node, count(1) ct
FROM table T1 JOIN table T2 ON T1.row_id+1 = T2.row_id
WHERE T1.unique_id = T2.unique_id # say, user_id
GROUP BY T1.node_id, T2.node_id

You can also provide weightage/cost to the edge by normalizing the count which will result in set of rows similar to
Current_node, next_node, count, count/max_count

This is all the data we need to generate input file for ‘dot’ application. Write a program that takes the above input and dump it into a file with content –

Digraph G {
# splines=false;
rankdir=LR

2 ->3 [penwidth=4.9, label="1190"]
3 -> 4 [penwidth=4.9, label="1150"]
4 -> 5 [penwidth=4.8]
5 -> 6 [penwidth=4.8]
6 -> 7 [penwidth=4.8]
7 -> 8 [penwidth=4.8]
…
…
}

By providing this as input you can generate the output in multiple formats including pdf, ps, etc. See graph 1 below. You can provide more input parameters in the file to fancy the graph or with more information like drop-off (% of users dropped) between states, see graph 2. In essence you are generating a digraph to visualize the data in a more meaningful way.

Digraph 1 - with sample weights between states 2 & 3, 3 & 4

With dot input file content like

subgraph cluster1 {
rankdir="LR"
2 -> 3 [penwidth="23", label="23.48%", taillabel="4450", headlabel="3405"]
3 -> 4 [penwidth="25", label="24.9%", taillabel="3405", headlabel="2557"]
4 -> 5 [penwidth="18", label="18.34%", taillabel="2557", headlabel="2088"]
5 -> 6 [penwidth="19", label="19.3%", taillabel="2088", headlabel="1685"]
6 -> 7 [penwidth="20", label="20.18%", taillabel="1685", headlabel="1345"]
7 -> 8 [penwidth="26", label="26.47%", taillabel="1345", headlabel="989"]
8 -> 9 [penwidth="35", label="35.29%", taillabel="989", headlabel="640"]
9 -> 10 [penwidth="39", label="38.59%", taillabel="640", headlabel="393"]
10 -> 11 [penwidth="36", label="35.88%", taillabel="393", headlabel="252"]
}

Digraph 2 - with users drop-off between states in %

Joins between tables of different charset

Not too long ago I got bitten by a slow performing MySQL query even though I had right indices (in fact unique index) and the tables were not big tables. The query which is similar to below was taking more than 20 mins which I ended up killing it.

The issue was one table was storing data in "latin1" while other in "utf8".

select count(1)

from table_1 T1

left outer join table_2 T2

on T1.id = T2.id

where T2.id is null

Note: "id"s here are varchars and they are not integers.

T1 had around 100K while T2 has around 800K joining on MySQL5.1, quad machine with 16G RAM. I expected much quicker response and on running the explain plan I could see the query doing the full table scan! First I wasn't sure why this was happening but after digging little bit and found out that one table was defined with "charset=latin1" and another with "charset=utf8".

MySQL was converting latin1 to utf8 and then performing the join which ended up with full table scan. On converting T2 to utf8 with collation utf8_unicode_ci, the query timing dropped to 10secs or so!

With that said by default now we create all tables with charset=utf8.

Life with Regular Expressions!

Damn this is good!
(Credit: xkcd.com)

TImezone conversion for hourly report

We recently built a real time hourly report where data is streamed in and summarized through multiple jobs and pushed into reporting tables. The business users wanted the reports as quickly as possible and first phase of the project needed to be implemented in short time. With very limited time from concept to design and implementation, I utilized 3rd party web charting service (www.chart.io) and pushed the highly summarized data to set of tables.

During a final reporting, I needed to convert date and hour in UTC to PST. The table had date (date type) and hour (tinyint) columns. Chart.io uses simple x, y chart (line, bar, etc.) and takes any co-ordinates passed for them and displays them. See below for one of final chart.

X axis is date and hour with format 'YYYY-MM-DD HH'. The following SQL converted the date and hour from corresponding 2 columns with format YYYY-MM-DD and HH respectively in UTC timezone to PST.

SELECT substr(convert_tz(concat(date, ' ', if(length(hour) = 1, concat('0',hour), hour)), 'GMT', 'America/Los_Angeles'), 1, 13) as x,

...

FROM

WHERE ...

GROUP BY x...

ORDER BY x ASC

Since maintaining the order is important. For example '2011-11-11 02' hour is right after '2011-11-11 01' hour and before '2011-11-11 03', single digit hours (1 through 9) are concatenated with "0'. Then convert_tz converted the GMT timezone to PST (America/Los_Angeles) time. There was suggestion to simply use -8.00 in hardcoded timezone difference but will result in incorrect hour during the months day light saving time (summer). Since data is stored for each hour in tables over many months and years, by having the above technique one doesn't need to worry about this issue. The system will automatically and correctly offset the hours in the report. Handling timezone conversion is a not a trivial one and see timezone wiki for different timezones around the world. For timezone conversion to work in MySQL, one should load the appropriate data into tables like this.

The reports/charts generated were very useful to business user and helped them take some quick decisive decisions and help build better user experience and improve sales and rev!

Converting 12 hour time Unix timestamp

Many a times I need to transform time in AM/PM format to unix timestamp for better performance. The external data coming in has date and time in two separate columns like '2011-01-01' as date and '11:59:59 PM' as time. And some other time, I also need to convert timezones typically from a UTC to PST and to do that check this out.

For example, converting "2011-01-01 11:59:59 PM" into corresponding unix timestamp results in an integer 1293926399.

# Create a temp table...
create table tmp.test_dt (
d date,
t varchar(20)
)engine=InnoDB

# Insert a value to test...
insert into tmp.test_dt
(d, t) values
('2011-01-01', '11:59:59 PM')

# Run the conversion sql...
SELECT
unix_timestamp(concat(cast(d as char), ' ', cast(time_format(str_to_date(t,'%r'),'%T') as char) ) )
FROM tmp.test_dt
# Output: 1293926399

# To test
SELECT from_unixtime(1293926399)
# Output: "2011-01-01 23:59:59"

In the conversion select statement, time_format converts the time (t column) from 11:59:59 PM to 23:59:59 of datetime type which is then cast as char to concatenate with date. After concat the string looks like "2011-01-01 23:59:59" which becomes input to unix_timestamp.

Loading timezones into MySQL or InfiniDB

If you are running queries or reports that utilize mysql timezone convertion function (CONVERT_TZ) make sure that you have run mysql utility (mysql_tzinfo_to_sql) that loads timezone information from the system into mysql server first. Without having loaded the timezone information one will simply get NULL return value and no message or error. This is expected behavior because the required tables for MySQL already exist but they are not populated!

In many threads and forums I have seen developers asking question on how to convert one timezone to another while all they get is null return value. I myself have bitten by it couple of times when I moved to new MySQL servers where the following steps were not done.

Here I am using Ubuntu which has timezone files under /usr/share/zoneinfo directory. In many flavors of unix including Mac OS you will find it there, if not search for it with find command ;)

Also, note you can load all timezones of the world in one go. The SQL created by this utility is large with more than 142K lines of SQL code! After completing the task the following tables must have been populated.

mysql.time_zone
mysql.time_zone_name
mysql.time_zone_transition
mysql.time_zone_transition_type

There is another related table mysql.time_zone_leap_second which we don't need at this point.

# To load all timezone
shell> mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root mysql

# You may see output like:
Warning: Unable to load '/usr/share/zoneinfo/Asia/Riyadh87' as time zone. Skipping it.
Warning: Unable to load '/usr/share/zoneinfo/Asia/Riyadh88' as time zone. Skipping it.
Warning: Unable to load '/usr/share/zoneinfo/Asia/Riyadh89' as time zone. Skipping it.
# On my system some specific timezone of Riyadh are not defined.  And you can ignore these warnings.

Restart mysql or infinidb server to bring in the timezones and run a test.

# In case of infinidb idbmysql is just an alias
shell> alias idbmysql
alias idbmysql='/usr/local/Calpont/mysql/bin/mysql --defaults-file=/usr/local/Calpont/mysql/my.cnf -u root'

shell> idbmysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 16
Server version: 5.1.39 MySQL Community / Calpont InfiniDB Community 2.2-2 Final (COSS LA)

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> SELECT utc_timestamp(), CONVERT_TZ(utc_timestamp(),'UTC','MET');
+---------------------+-----------------------------------------+
| utc_timestamp()     | CONVERT_TZ(utc_timestamp(),'UTC','MET') |
+---------------------+-----------------------------------------+
| 2011-07-29 01:14:40 | 2011-07-29 03:14:40                     |
+---------------------+-----------------------------------------+
1 row in set (0.00 sec)

Linux 20 Years

It has been while since blogged but couldn't resist having this clip embedded in my blog as a tribute to Linux that made the freeware accessible to many.
Enjoy,
Shiva

MySQL Connection over SSH

When MySQL server is setup to accept only localhost connection through bind_address in /etc/mysql/mysql.cnf (configuration)

# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
bind-address          = 127.0.0.1

or you want to connect as "root" to mysql and use GUI clients like Workbench or Toad, etc. you can use SSH tunneling.

In operating systems that provide built-in ssh support it is made little easier. Mac OS is one such OS, which I will use as an example below.

Assuming you have already setup ssh connection, run the following command from the local terminal. If you have not setup the ssh, see this brief introductory ssh article. Or run > man ssh for more details.

prompt>  ssh -L 3307:localhost:3306  user@remote.com

and then let this terminal running. Here I am using ssh -L [bind_address:]port:host:hostport parameter and skipped bind_address which is needed only when you have more than one IP attached (multi-IP) to your local system which is typically not the case.

    3307 - The port that you will use on the other clients that need to connect to remote mysql. For example, Workbench.

    localhost:3306 - The remote login details as if you had connected to it through ssh and connecting to mysql instance running on port 3306. If you had successfully, ssh-ed to remote host at the prompt you would have run > mysql -h localhost -u user_name -P 3306

    user@remote.com - Remote user name and remote host

Now start Workbench and setup new connection string by providing the details as if you are connecting to a mysql instance on your local machine running on port 3307. See figure 1. ssh will automatically tunnels your connection to remote host's mysql! See figure 2. On ending the tunneling setup either my closing the terminal where you were running "ssh -L" command and killing it will disconnect the connection to Workbench.

Figure 1.

Figure 2.

Note: Same technique can be used to for other client applications to connect to any of remote applications/databases over SSH.

Have fun,

Shiva

DW solution - IP mapping to Geo Locations

In my previous few posts on the blog, I covered how to convert IP addresses to corresponding integers and back. When performing a Data Warehouse (DW) tables join between IP addresses for location analysis joining on integer IP column is much faster than joining on string column. As noted in other posts, this article covers IPv4 but technique could be used for IPv6 addresses on using 2 BIGINT.

For geo location info, you can use MaxMind's Geo-City data that is provided in two entities - one with mapping IP block to a location ID and another one with location ID to city/state/country. It is one to many (1:M) relationship and to find an IP address location one need to find the matching IP block and then map it to city or country. There are posts that describe in detail how to perform it effeciently for a given IP. These techniques will work for an OLTP environment where you pass known IP address from UI or application layer to database and gets its city/country.

Fast MySQL Range Queries on MaxMind GeoIP Tables byAndy Skelton
MaxMind GeoIP Install Setup Tutorial by Bartomedia
Finding a user's country/region based on IP by Shlomi Noach
On efficiently geo-referencing IPs with MaxMind GeoIP and MySQL GIS by Jeremy Cole

Geo-City data has following fields seperated by comma and row terminated by new line (\n):

Location ID
Country Code
State/Region
City Name
Postal Code (US only)
Latitude
Longitude
Metro Code
Area Code

IP Block has:

Start IP Address
End IP Address
Location ID

While loading this dimensional data during ETL process, you should also do the IP address transformation to integer as explained in "Converting an IP address to an integer". And Geo-City dimensional table will also have partition range (the reason for this post) and surrogate key (id). You should also perform other data validation like non-overlapping ranges, valid country codes, etc. which are all typical standard activities of DW ETL process. The time, resource and effort put into cleaning and validating the data before pushing the data into final fact tables will pay well within short period of time. To fix a data issue and reconcile the reports in DW/BI environment always leads to lower ROI.

You can also denormalize the above tables and create a single table. In either case by joining on range of IP address for a given set of IPs will result in the same slow performance.

dim_geo_ip_city_location holds around 5.9million rows with following attributes:

CREATE TABLE dim_geo_ip_city_location (
    location_id int NOT NULL,
    start_ip varchar(16) NOT NULL,
    end_ip  varchar(16) NOT NULL,
    country char(2),
    region varchar(64), 
    city varchar(128),
    postal_code varchar(32),
    latitude decimal(8,4),
    longitude decimal(8,4),
    metro_code varchar(8),
    area_code varchar(8),
    ...
)ENIGNE=...

CREATE TABLE dim_ip_block (
    start_ip_num bigint NOT NULL,
    end_ip_num bigint NOT NULL,
    location_id int NOT NULL,
        ...
)ENGINE=...

In MySQL (5.0) optimizer performs range operation effeciently with a constant meaning that you have given the query an IP address like "WHERE ip_addr between 1230 and 1234", for example, but not with range operations without constants. For more info on using range optimization see mysql documentation and for indexing see "Mastering the Art of (MySQL) Indexing".

I did not find any article providing some insights into doing DW/BI analysis on IP addresses. In a DW environment the business user is interested in slicing and dicing the hundreds of thousands to million of IPs. Just trying to find few thousand IP address locations (See query 1 below) was taking more than an hour when joined between lookup table (a table similar to dim_geo_ip_city_location but without partition column (ip_range_partition) and id) and ip address join table even with appropriate indexes like index on (start_ip_num, end_ip_num) or start_ip_num and end_ip_num.

Query 1:

-- Simple query to select IP block
SELECT IA.ip_num, B.location_id, B.start_ip_num, B.end_ip_num
FROM ip_addresses IA use index(...)
JOIN dim_ip_block B use index(...)
WHERE IA.ip_num between B.start_ip_num and B.end_ip_num

This leads to MySQL performing full scans. And in the case of ip_block table (B) it doesn't use any index. For this to be avoided, I partitioned the dimension (lookup) table into 1000 parts and since this table has ~5.9 Million rows each partition will result in ~5,900 rows in each partition. For more on data set partitions see its wiki. The partitions should be numbered in the ascending order with respect to ascending range of ip blocks (start_ip_num). Then also create a another table (dim_ip_range_partition) with start_ip_num and end_ip_num for this each range basically creating super IP blocks.

CREATE TABLE dim_ip_range_partition (
     partition_id int NOT NULL AUTO_INCREMENT,
     start_ip_num bigint NOT NULL,
     end_ip_num bigint NOT NULL,
     ....
) ENGINE=InnoDB ;

Also, create a primary key on patition_id and index on (start_ip_num, end_ip_num) on this table.
The solution is based on two things that there is no overlapping IP block and IP blocks are ordered (in ascending order).

To populate the above table,

Add id (row_id) column to dim_ip_block
Add partition_id integer column to dim_ip_block
Update id from 1 to n (highest start_ip_num) in ascending order of start_ip_num
Partition "n" ids into P parts (n/P) where n > 1 and 1 < P < n
Assign each part/partition ascending integer value
Now update above table dim_ip_range_partition P_TABLE from dim_ip_block B_TABLE where P_TABLE partition id between 1 and P of B_TABLE and P_TABLE.start_ip_num is lowest B_TABLE.start_ip_num of the corresponding partition and P_TABLE.end_ip_num is highest of corresponding partition. In other words...

         INSERT INTO dim_ip_range_partition
            (partition_id, start_ip_num, end_ip_num, ...)
         SELECT partition_id, min(start_ip_num), max(end_ip_num), ...
         FROM dim_ip_block
         GROUP BY partition_id
         ORDER BY partition_id asc

By doing so, we have created a super IP blocks each having n/P blocks from the original ip_blocks.   And the following query is one way to extract city/country info for given set of IP addresses.

SELECT t.ip_num, t.ip, B.start_ip_num, B.start_ip, B.end_ip_num, B.end_ip
       ,L.city, L.country
FROM 
(
   --- Get the super ip block (partition) 
   SELECT C.ip_num, max(partition_id) partition_id, max(P.start_ip_num) start_ip_num
   FROM ip_addresses C use index(...)
   JOIN dim_ip_range_partition P 
   WHERE C.ip_num >= P.start_ip_num
   GROUP BY C.ip_num
) t
LEFT OUTER JOIN mktg_ip_block B use index(...)  
  ON  t.partition_id = B.partition_id     --- And join on this partition
  AND t.ip_num >= B.start_ip_num 
  AND t.ip_num <= B.end_ip_num
JOIN dim_geo_ip_city_location L          --- Get city/country
  ON B.location_id = L.location_id

The query that would have taken more than hour returns within 30 seconds now! You can further optimize the above design with say more denormalized relations, having a large number of partitions (this is a trade-off, with highest partition n = P, each partition is same as original table which will result in worse off performance!) on both lookup table and lookedup table. Or extending the partitioning concept to more levels (super super block, ....) you can reduce the number of joining rows with little bit more complex query.

The solution is a typical Computer Science way of handling the complex problem by dividing into smaller ones and solving it. In this it is more specific to MySQL solution for large data sets with range operations.

Cheers,
Shiva

Tools to convert IP address

Following tool converts an ip address to other formats. For more on IP address conversion check out part 1 and 2.

Converting an IP address to an integer and reverse (Part 1)
IP Address conversions (Part 2)

IP address converter:

Enter IP:
	(e.g. 192.168.100.200 or 3232261320 etc)

You can also check out 2's complement of a number here. This will be handy when one needs to check for signed numbers (2's complement).

IP Address conversions

This is the part 2 of converting an IP address to an integer or vice versa. For more details and little introdution see Part 1.

Converting an ip address to integer and back is little straight forward in MySQL compared to SQL Server since SQL Server (2008) doesn't provide bit shift operators.

In short you can't do,

SELECT 1<<2

nor

SELECT 3232261320 -- IP '192.168.100.200'
* (256*256*256*256*256*256) -- Left shift 6 bytes
/ (256*256*256*256*256*256*256) -- Right shift 7 bytes to get 2nd octet

One will encounter "Arithmetic overflow error converting expression to data type int". Even casting each integer to bigint will not solve the problem as the multiplication results in overflow of signed bigint. And there is no unsigned bigint in SQL Server.

Another way to solve this is to do the following in SQL Server by using bit AND. Retain only the byte that is off interest and then right shift the bits appropriately.

#-- IP '192.168.100.200'
SELECT (cast(3232261320 as bigint) & cast(4278190080 as bigint))/(256*256*256);
SELECT (cast(3232261320 as bigint) & 16711680)/(256*256);
SELECT (cast(3232261320 as bigint) & 65280)/(256);
SELECT (cast(3232261320 as bigint) & 255);

#- And the output will be
192
168
100
200

On systems with unsigned bigint you can also do (MySQL)

SELECT cast(3232261320*256*256*256*256 as unsigned)
/ (256*256*256*256*256*256*256) 4th_octet;

SELECT cast(3232261320*256*256*256*256*256*256*256 as unsigned)
/ (256*256*256*256*256*256*256 as unsigned) as 1st_octet;

They will result in outputs 192 and 200 respectively.

With given ip you can simply run the following to get its integer value,

mysql toad>  
SELECT ( ((substring_index(ip, '.',1))<<24) 
       + ((substring_index(substring_index(ip, '.', 2), '.', -1))<<16)
       + ((substring_index(substring_index(ip, '.', 3), '.', -1))<<8)
       + ((substring_index(substring_index(ip, '.', 4), '.', -1)<<0))
)
FROM
( SELECT '192.168.100.200' as ip
  UNION ALL
  SELECT ip FROM som_ip_table;
) t;

Cheers,
Shiva

Converting an IP address to an integer and reverse

In dealing with IP address analysis one may required to map IPs to geo location. There are third party applications or software that perform IP to city, state, country mapping and businesses, for example, can make use of them to see where their web traffic is coming from.

Article covers IPv4 addresses which use 32 bits and similar technique can be used to IPv6 addresses which use 128 bits by using 2 BIGINTs to store (64bits each).

It is easy to handle reasonable load when one is querying for few IPs or few hundred IPs either from simple user interface entering one or more IP addresses or through APIs. In these cases IP addresses can simply be matched in string format without much of performance issue but it can become a performance bottleneck when dealing with millions of rows loaded through ETL process into Data Warehouse environment and reports are run against fact tables with hundreds of millions of rows. Under these circumstances converting IP to an integer will boost the performance substantially.

MySQL provides functions to convert IP string to integer and backwards. For example, inet_aton converts string to a number (integer) and inet_ntoa converts number to integer.

inet_aton and inet_ntoa:

mysql> select inet_aton('192.168.100.200');
+------------------------------+
| inet_aton('192.168.100.200') |
+------------------------------+
|             3232261320 |
+------------------------------+

mysql> select inet_ntoa(3232261320);
+-----------------------+
| inet_ntoa(3232261320) |
+-----------------------+
|     192.168.100.200 |
+-----------------------+
1 row in set (0.00 sec)

NFS Cache - Invisible file issue

I ran into a following situation and it took some time to diagnose the issue and help from couple of folks from DBA and operations team to resolve it. Here is what happened.

I exported a large data set from MySQL to a file in directory /dir_1/dir_2/exported_file.txt, for example, in an application. Then after the file was exported the application went on to consume the file by reading it's content. Since MySQL OUTFILE (exporting data) doesn't overwrite a file if the file name already exists, the code would rename the file to *.bak. See below for pseudo code.

If OUTFILE exists
    Move or Rename OUTFILE to OUTFILE.bak    /* Step 1 */
Run MySQL export to OUTFILE    /* Step 2 */
Check the error code
Read OUTFILE and parse    /* Step 3 */

When I ran the application, it would sometime create the output file and go on to parse it correctly but many a times it would fail in step 1 throwing an error like "file already exists" when in fact it was not. Because I had removed the file with 'rm -f' before rerunning the program. Other times it would fail in step 3 indicating that file does not exists even though SQL exported the file successfully in step 2. I even provided sleep time between each step ranging from 5 to 60 seconds but continued to see the same random behavior.

After spending sometime trying to diagnose what might be going on, ended up debugging NFS caching. The directory /dir_1 was a mounted file system with NFS caching set to few hundred seconds. When the application wrote to NFS directory, the write cache was updated but not the OS directory structure (inode). Reducing the parameter setting (actimeo) to lower number, say 30 seconds, will help alliviate the delay. If sys admins are reluctant to change the older mounted system settings, you should get a new mount point with actimeo set (30).   Once these changes were made application was able to run smoothly with the application sleep set to little higher than actimeo timings. Note, using actimeo sets all of acregmin, acregmax, acdirmin, and acdirmax to the same value. There is no default value. See man pages for more details.

Cheers,
Shiva

Hourly web log analysis through Hadoop

Many a time one may want to parse the weblogs for doing some quick analysis on AB tests or for security/fraud alerts or recent advertisement or campaign analysis. There are many applications or utilities that perform web log analysis but more often than not regular expressions provide a powerful and elegant ways to analyse these logs and especially handy when one is dealing with massive and quickly rotating web logs. Check out this wiki for more general info on web analytics.

When each weblog is multi-gigabytes and moved to archive every couple of hours and there is farm of web servers (hundreds or thousands of web servers), many of the vendor or 3rd party applications don't scale up either. Hadoop streaming with simple utilities can provide insights into what otherwise would have been a costly experiment.

Here is an regular expession used to extract HOUR and the string that is of interest from the Apache web logs. Each entry in the log has the format similar to the ones below.

01/Jun/2010:07:09:26 -0500] - 127.0.0.1 "GET /apache_pb.gif HTTP/1.1" status=200 size=2326 ab=test1AB ....

01/Jun/2010:07:09:26 -0500] - 127.0.0.1 "GET /apache_pb.gif HTTP/1.1" status=200 size=2326 ab=test-2CD ....

For more details on Apache log configuration, take a look at its specs and custom logs. Along with time of the day of the site hit, user IP address, http request (GET or POST), page requested, protocol used, one can configure the web server to log many more details including referer, user agent (browser), environment variables, etc.

Conversation Prism - An Image

As the social media, social networking, advertising, Internet marketing continue to evolve with new technologies and many companies create their own social groups, it all gets more complex and confusing. Many a times a picture or an image will explain more elegantly than 1000 words are more and in some cases image is the most suited tool to explain. Here is one image created by Brian Solis & Jesse Thomas that I like in this conversation!