Data and Technology: 2012

Github weekend issue - Dec.23, 2012

Oh oh. Github seemed to have major issue on Saturday from ~4.30pm PST. It is has been down for while now :(. The site status page https://status.github.com/ is showing

And main site

Greenplum Postgres Regexp fun

Last few months I started woking in Greenplum/ postgres environment and was pleasantly surprised to see the regular expression capabilities. Compared to Mysql (5.1) it has quite advanced RE features. For example I ended up using the following query to strip the decimal numbers from another string.

Input strings has the format "BucketN_N[ -N | + ]" where N is decimal number. For example, "Bucket1_1" or "Bucket7_21-50" or "Bucket5_100+" are all valid and output should extract bucket number (first N), minium value (second N) and maximum value of the bucket (third optional N).

The query

SELECT
  bucket_string
  ,substring(qty_bucket, 7, 1)::integer as bucket_num
  ,regexp_replace(bucket_string, '^Bucket\\d{1,}_(\\d{1,})([-+].*)?', E'\\1')::integer as minimum_qty
  ,regexp_replace(regexp_replace(bucket_string, '.*-(\\d{1,})', E'\\1'),  '.*_(\\d{1,})([+]?)$', E'\\1')::integer
  as maximum_qty
FROM a_table
ORDER BY bucket_string

Though I could have used other functions like substring and other, it was fun using regular expression in smallest code. regexp_replace is the workhorse taking a string and replacing it with matched string (\\1 or $1). "^ and $" are anchors to beginning and end of string, \\d matches with decimal numbers and {1,} means 1 to many quantifier.

And more with regexp_matches and array selection. I wanted to select offset number of days from 2012 December 01 in this simplified example. The offset is a text in another external table of Greenplum and data is not always clean and could have some text before the offset.

SELECT d
  regexp_matches(d, '(?:.*_)?(\\d+)$')    re_ary
 , '2012-12-01'::date+((regexp_matches(d, '(?:.*_)?(\\d+)$'))[1] || 'days')::interval   new_d
FROM
(
    SELECT '90' d
    union
    SELECT 'new_90' d
    union
    SELECT 'old_90_10' d
) tt

returns

d         re_ary      new_d
-------------------------------------
90         {90}      2013-03-01 00:00:00
new_90     {90}      2013-03-01 00:00:00
new_90_10  {10}      2012-12-11 00:00:00

With "?:" is non-capturing group and this matched text is optional meaning any text before string ending number is not-captured for later use and original string may have it but not all the time. '$' anchors it to the end of string with regexp_matches returning an array of single element and subscript [1] selecting it.

Nice chart usage for dynamic pricing

Real time analytics allows to track and monitor users and their activities; and then adjust what is presented to them. It could be a relevant advertisement like showing Nikon camera advertisement for a user searching for a new camera or camera prices or anything similar.

Taking one step further is when the price of an item(s) is adjusted depending on whether that user is a loyal customer and/or there is higher possibility that he/she may buy other accessories. Realtime predictive analytics makes this possible.

Below is an nice graph presented by Wall Street Journal and at http://www.ritholtz.com/blog/2012/09/lucky-us-toilet-paper-priced-like-airline-tickets/

Graph shows three companies' (Sears, BestBuy and Amazon) price variation over a day for a microwave. Amazon increased prices during the peak hours by more than 10% (~8am to 12.30pm and then again 3pm to ~9pm EDT). All times shown in graph are in PDT (Pacific Day Time) timezone.

Even more interesting will be observe whether prices were varied based on user location or where Amazon's servers were located? As it is simple to geo map the IP address of a user computer/device and vary the prices accordingly! Different users from different cities at different times will see different prices. The price points and user experience can be optimized for improved sales!

Using d3 visualization for fraud detection and trending

d3 (Data Driven Document) is a great data visualizing tool and recently used it to track possible fraud or how some metrics have behaved over few hours to over few weeks. You can filter a group that is of interest out of, say, million or more and then use d3 to work through manually for more unique set.

The graph is dynamic (unlike in this blog) where you can select range of lines by moving the cursor over any of Y-axes and selecting them. It is much more interesting in action than a snapshot of images below.

Below is users' "A" metric over 5 hours, 5 to 12 hours, 12 hours to 1 day, similarly up to last 14 to 28 days. The user group was selected from few different segments and each major color corresponds to a segment. d3 automatically varies the color slightly for each line. Though it is not very clear in the image below main colors were blue, green, orange and purple.

The input to d3 is a simple csv file and all the presentation is handled by d3 unlike in many previous packages I had used where I ended up creating a output file in html or some xml for flash. Big advantage with d3 over these is attaching the html element to each data point in the programming and in-built data visualizing functions do the rest of magic.

In the above scenario, for example, you can move the cursor to left most scale (5 hour) and zoom in on lines above 1,700 to 2,000. There is only one user well above the rest who have metric 200 or lower. This user hasn't done much in last 4 weeks until last 5 hours! Time to look into what this user is doing and we use other tools to do further analysis.

Similar to above scenario, below is another graph where I am interested in about all users whose score was between 600 and 1,400 over last 2 to 7 days. There is not much exciting in below graph and have seen more interesting details other times.

Happy data visualization!

Users Password Analysis

As a data engineer it is always interesting to work on large unique data sets. With recently released Yahoo users details (453K) many insightful info can be gleaned from the data. For example, even though password hacking is well known for long time still large number of users use simple passwords, sometime as simple as "password" or "123456" or similar. Here are top 10 passwords and number of users who had used them!

123456 1667
password 780
welcome 437
ninja 333
abc123 250
123456789 222
12345678 208
sunshine 205
princess 202
qwerty 172

It is interesting to see how many users had unique passwords which was not used by anyone in this data set. There were 10.6K users with no password which might be due to data issue and ignored for many of calculations and only ~304K (69%) users with unique passwords.

Another interesting insight is if password is used by more than one user, there is likely hood that it is some kind of latin word or words ("whatever", "iloveyou") or proper name ("jordon", "ginger") or some number (123321) or what can easily be guessed (for example, "q1w2e3r4" for qwerty keyboard or "asdfgh", etc.). Even when two users used the same password there was some certainty that it is a guessable password! With each additional user the certainty increases quite quickly. Under these circumstances, even if a password is encrypted (by md5 or sha or other encryptions) by service providers, with brute force application one can find out the password for these users.

By also looking into how users from different email service providers had their passwords setup showed the following. As expected, Yahoo had more users (x-axis) while smaller companies ("others" in the chart) had more number of users (71.7%) with unique passwords. At the same time gmail and live users' password length is more than 8.87. Length of the passwords is represented by size of the bubble.

Having bigger bubble size and higher up in the Y-axis is better as it represents more users using unique passwords with longer password strings. See table below for more details.

Even more interesting analysis can be done including people's or places' names in their password. One could be able to use popular names from US Social Security Administration's and names' list go as back as 1880! There were lot more passwords that simply used these names! Lot more matches can be found with minor modifications like changing i to 1 or o to 0 (zero), etc.

With many users using simple passwords service providers or websites should force each user to have stronger password by enforcing them during the registration or each login. Users should also be forced change them once in few months. It might be even better each computer equipped with finger or eye reader that can be used for user authentication thus avoiding this whole password mess.

Simple time series metric

Here is an example of a fusion chart recently created using MySQL, Perl, PHP and Fusion charts with fudged data. X axis is date & hour for last 7 days. The data is pulled from MySQL and generates a XML file for consumption by PHP and Fusion.

The time data is in UTC (unix timestamp and first second of each hour. For example, 1262307600 => 2010-01-01 01:00:01) summarized with other dimensions and metrics. The simplified fact table looks like

ts_begin_hr, dim_1, dim_2, metric_1, metric_2, etc.

The query used to pull required report data.

SELECT 'dim1' dim, concat(D_H.d, ' ', D_H.hr) pst,

CASE WHEN P.met1 IS NULL THEN 0.0 ELSE P.met1 end METRIC

FROM

(

SELECT d.d, h.hr

FROM

(SELECT '00' hr UNION SELECT '01' UNION SELECT '02' UNION SELECT '03' UNION SELECT '04'

UNION SELECT '05' UNION SELECT '06' UNION SELECT '07' UNION SELECT '08'

UNION SELECT '09' UNION SELECT 10 UNION SELECT 11 UNION SELECT 12

UNION SELECT 13 UNION SELECT 14 UNION SELECT 15 UNION SELECT 16

UNION SELECT 17 UNION SELECT 18 UNION SELECT 19 UNION SELECT 20

UNION SELECT 21 UNION SELECT 22 UNION SELECT 23

) h

CROSS JOIN

(SELECT full_date d FROM DW_DB.dim_date

WHERE full_date BETWEEN date_add(current_date(), interval -6 day) AND current_date()

) d -- 1

) D_H

LEFT OUTER JOIN

(

SELECT 'dim1' dim,

, date(convert_tz(from_unixtime(ts_begin_hr), 'GMT', 'America/Los_Angeles')) d

, hour(convert_tz(from_unixtime(ts_being_hr), 'GMT', 'America/Los_Angeles')) hr

, sum(met1) met1

FROM DW_DB.FACT_TABLE FT

JOIN DW_DB.DIM_TABLE DM

ON FT.dim_id = DM.dim_id

WHERE from_unixtime(ts_begin_hr) >= date_add(now(), interval -7 day)

GROUP BY dim, d, hr

) P -- 2

ON D_H.d = P.d and D_H.hr = P.hr

WHERE D_H.d < date(convert_tz(now(), 'GMT', 'America/Los_Angeles'))

OR (D_H.d = date(convert_tz(now(), 'GMT', 'America/Los_Angeles'))

AND D_H.d <= hour(convert_tz(now(), 'GMT', 'America/Los_Angeles'))

)

ORDER BY D_H.d, D_H.hr

;";

Subquery 1 (in blue) gets a cross-joined table of 24 hours (0..23) with last 6 dates resulting in output like YYYY-MM-DD HH. Single digit hours are converted to two digits in string format (SELECT '0n'). The system and mysql were running in UTC timezone while business users needed them in PST tz. Subquery 2 (in brown) generates required metric in PST and joined with subquery 1. With where clause any extra hours (later than current hour) is filtered out in left outer join.

Subquery 1 is joined with left outer join for possibility that during a given hour there might not be metric (say, dollar amount) available because there might not be any transaction during that hour for a given dimension.

The result set is dumped to XML file for Fusion chart consumption resulting in above output through php wrapper. The chart is updated every hour.

Visualizing daily metric with tiles

One of the effective way to present a time series data over long period of time is either typical line charts or some modified versions of it. It gets little harder to visualize when would like to do see clustering of data points. Here is one where I find it tiled series that gives quick glimpse of a some metric varying over few years. This is a daily metric starting January 01, 2010 and up to recent week of May, 2012; nearly 2 and 1/2 years of metric variation.

The Y-axis measures represent day of the week (wday) with 1 representing Monday, 2 Tuesday and so forth with 7 as Sunday. I set the values to these so that I could cluster weekends metrics (5,6 & 7) together vs the weekday (1,2,3,4 - Monday to Thursday). X-axis represent week of the year (1 to 52 or 53). Year 2010, 2011 and 2012 are series. Metric values varies from 0.00 to 10.00 and color of each tile varies slightly based on the metric value.

If you were to use discrete values of metrics say, 0 to 5 the color coding is similarly quite distinct. See graph 2 below. The data is from 2nd week of March, 2010 to May, 2012.

Graph 1

Graph 2

User traversal digraphs

Visualizing the users' traversal on a website or path traversed in an application or nodes visited may need to be analyzed for better user experience or improving the efficiency the path traversed or to improve the conversion rate by getting the user to the end state. For example, it may be that users visiting one set of nodes (node meaning particular state of application like tutorial state machine) convert better than users going through different set of states. There are many cases where a graph or directed graph is a necessity.

You can use freeware tools like ‘dot’ to draw directed graphs. In the following example, I assume the state/node is represented by simple integer and path taken by the edge (directed line).

Here, I have shown a simplified version of a project I worked on in quickly generating a digraph that can be sent or uploaded to site for internal business users on regular intervals in real time. Few assumptions are made to help explain the main idea and they are - each visit to the node is timestamped and user can only traverse forward (meaning visit to the node number higher than the current one). To start with data is summarized in a table with each user or device or some unique key identifier, node visited, and time of visit.

Order the data in chronological order for each unique id so that by doing the self-join (see code snippet below) in SQL we can simply find out the next node the user visited.

SELECT T1.node_id current_node, T2.node_id next_node, count(1) ct
FROM table T1 JOIN table T2 ON T1.row_id+1 = T2.row_id
WHERE T1.unique_id = T2.unique_id # say, user_id
GROUP BY T1.node_id, T2.node_id

You can also provide weightage/cost to the edge by normalizing the count which will result in set of rows similar to
Current_node, next_node, count, count/max_count

This is all the data we need to generate input file for ‘dot’ application. Write a program that takes the above input and dump it into a file with content –

Digraph G {
# splines=false;
rankdir=LR

2 ->3 [penwidth=4.9, label="1190"]
3 -> 4 [penwidth=4.9, label="1150"]
4 -> 5 [penwidth=4.8]
5 -> 6 [penwidth=4.8]
6 -> 7 [penwidth=4.8]
7 -> 8 [penwidth=4.8]
…
…
}

By providing this as input you can generate the output in multiple formats including pdf, ps, etc. See graph 1 below. You can provide more input parameters in the file to fancy the graph or with more information like drop-off (% of users dropped) between states, see graph 2. In essence you are generating a digraph to visualize the data in a more meaningful way.

Digraph 1 - with sample weights between states 2 & 3, 3 & 4

With dot input file content like

subgraph cluster1 {
rankdir="LR"
2 -> 3 [penwidth="23", label="23.48%", taillabel="4450", headlabel="3405"]
3 -> 4 [penwidth="25", label="24.9%", taillabel="3405", headlabel="2557"]
4 -> 5 [penwidth="18", label="18.34%", taillabel="2557", headlabel="2088"]
5 -> 6 [penwidth="19", label="19.3%", taillabel="2088", headlabel="1685"]
6 -> 7 [penwidth="20", label="20.18%", taillabel="1685", headlabel="1345"]
7 -> 8 [penwidth="26", label="26.47%", taillabel="1345", headlabel="989"]
8 -> 9 [penwidth="35", label="35.29%", taillabel="989", headlabel="640"]
9 -> 10 [penwidth="39", label="38.59%", taillabel="640", headlabel="393"]
10 -> 11 [penwidth="36", label="35.88%", taillabel="393", headlabel="252"]
}

Digraph 2 - with users drop-off between states in %

Joins between tables of different charset

Not too long ago I got bitten by a slow performing MySQL query even though I had right indices (in fact unique index) and the tables were not big tables. The query which is similar to below was taking more than 20 mins which I ended up killing it.

The issue was one table was storing data in "latin1" while other in "utf8".

select count(1)

from table_1 T1

left outer join table_2 T2

on T1.id = T2.id

where T2.id is null

Note: "id"s here are varchars and they are not integers.

T1 had around 100K while T2 has around 800K joining on MySQL5.1, quad machine with 16G RAM. I expected much quicker response and on running the explain plan I could see the query doing the full table scan! First I wasn't sure why this was happening but after digging little bit and found out that one table was defined with "charset=latin1" and another with "charset=utf8".

MySQL was converting latin1 to utf8 and then performing the join which ended up with full table scan. On converting T2 to utf8 with collation utf8_unicode_ci, the query timing dropped to 10secs or so!

With that said by default now we create all tables with charset=utf8.