Data and Technology: March 2012

Visualizing the users' traversal on a website or path traversed in an application or nodes visited may need to be analyzed for better user experience or improving the efficiency the path traversed or to improve the conversion rate by getting the user to the end state. For example, it may be that users visiting one set of nodes (node meaning particular state of application like tutorial state machine) convert better than users going through different set of states. There are many cases where a graph or directed graph is a necessity.

You can use freeware tools like ‘dot’ to draw directed graphs. In the following example, I assume the state/node is represented by simple integer and path taken by the edge (directed line).

Here, I have shown a simplified version of a project I worked on in quickly generating a digraph that can be sent or uploaded to site for internal business users on regular intervals in real time. Few assumptions are made to help explain the main idea and they are - each visit to the node is timestamped and user can only traverse forward (meaning visit to the node number higher than the current one). To start with data is summarized in a table with each user or device or some unique key identifier, node visited, and time of visit.

Order the data in chronological order for each unique id so that by doing the self-join (see code snippet below) in SQL we can simply find out the next node the user visited.

SELECT T1.node_id current_node, T2.node_id next_node, count(1) ct
FROM table T1 JOIN table T2 ON T1.row_id+1 = T2.row_id
WHERE T1.unique_id = T2.unique_id # say, user_id
GROUP BY T1.node_id, T2.node_id

You can also provide weightage/cost to the edge by normalizing the count which will result in set of rows similar to
Current_node, next_node, count, count/max_count

This is all the data we need to generate input file for ‘dot’ application. Write a program that takes the above input and dump it into a file with content –

Digraph G {
# splines=false;
rankdir=LR

2 ->3 [penwidth=4.9, label="1190"]
3 -> 4 [penwidth=4.9, label="1150"]
4 -> 5 [penwidth=4.8]
5 -> 6 [penwidth=4.8]
6 -> 7 [penwidth=4.8]
7 -> 8 [penwidth=4.8]
…
…
}

By providing this as input you can generate the output in multiple formats including pdf, ps, etc. See graph 1 below. You can provide more input parameters in the file to fancy the graph or with more information like drop-off (% of users dropped) between states, see graph 2. In essence you are generating a digraph to visualize the data in a more meaningful way.

Digraph 1 - with sample weights between states 2 & 3, 3 & 4

With dot input file content like

subgraph cluster1 {
rankdir="LR"
2 -> 3 [penwidth="23", label="23.48%", taillabel="4450", headlabel="3405"]
3 -> 4 [penwidth="25", label="24.9%", taillabel="3405", headlabel="2557"]
4 -> 5 [penwidth="18", label="18.34%", taillabel="2557", headlabel="2088"]
5 -> 6 [penwidth="19", label="19.3%", taillabel="2088", headlabel="1685"]
6 -> 7 [penwidth="20", label="20.18%", taillabel="1685", headlabel="1345"]
7 -> 8 [penwidth="26", label="26.47%", taillabel="1345", headlabel="989"]
8 -> 9 [penwidth="35", label="35.29%", taillabel="989", headlabel="640"]
9 -> 10 [penwidth="39", label="38.59%", taillabel="640", headlabel="393"]
10 -> 11 [penwidth="36", label="35.88%", taillabel="393", headlabel="252"]
}

Digraph 2 - with users drop-off between states in %

Not too long ago I got bitten by a slow performing MySQL query even though I had right indices (in fact unique index) and the tables were not big tables. The query which is similar to below was taking more than 20 mins which I ended up killing it.

The issue was one table was storing data in "latin1" while other in "utf8".

select count(1)

from table_1 T1

left outer join table_2 T2

on T1.id = T2.id

where T2.id is null

Note: "id"s here are varchars and they are not integers.

T1 had around 100K while T2 has around 800K joining on MySQL5.1, quad machine with 16G RAM. I expected much quicker response and on running the explain plan I could see the query doing the full table scan! First I wasn't sure why this was happening but after digging little bit and found out that one table was defined with "charset=latin1" and another with "charset=utf8".

MySQL was converting latin1 to utf8 and then performing the join which ended up with full table scan. On converting T2 to utf8 with collation utf8_unicode_ci, the query timing dropped to 10secs or so!

With that said by default now we create all tables with charset=utf8.

User traversal digraphs

Joins between tables of different charset