Data and Technology: 2013

Your genome data through API

Recently 23andme.com reduced their prices of DNA test kit to $99 and now you get your hereditary and other related information for price less than $100! That is a big drop from where it started $999 and then $299. I know little about genome/ genomics but it is a very interesting, dynamic, fast growing field which has potential to change the way we view health (one now is empowered to know whether they are at any risk of potential congenital diseases now or in the future ) or one's ancestry!

My interest was in the data that you can play with. With the API in place you can pull your or demo data. To do that first I needed to setup data pull through API and following is quick summary of setup I had on my mac.

Some important links:
23andme API
OAuth Introduction
OAuth flow

After creating a developer login account you can set up the application/client with given credentials - client_id and client_secret. See below.

For quick testing and pull, I used Dancer web framework on local mac and 23andme uses OAuth2 with 3 legged authentication. As a first step get the 'code' by creating a simple page with link to their login page.

For exmaple link below takes the user to authorize your client and once successfully user logins the authentication dance happens between the client and server.


"a api.23andme.com="" authorize="" href="http://www.blogger.com/" https:="" redirect_uri="http://localhost:5000/receive_code/&response_type=code&client_id=YOUR_CLIENT_ID&scope=basic" rs123="">"Connect with 23andMe. [pid: $$]";

Note: pid above is a process id for me to know when I killed and restarted the Dancer.

User clicks on the link

and then login to authorize your client to access to her resources. This is the point where 'code' is received and exchanged for access_token.

After successful OAuth dance now you can call any of end-points ( https://api.23andme.com/docs/reference/ ). Here is demo user call ( https://api.23andme.com/1/demo/user/ )

Data bandwidth diagram - Washington Post article

In the last few days there have been lot of news with respect to NSA leaks. One of the presentation slide (#2) in article has the bandwidth capacity as shown below. Digram utilized D3 and data in csv file.

Shown below are the same in chord diagram with javascript animation highlighting each regions bandwidth with mouse-over.

Asia/Pacific Region highlighted below:

Git useful links

Here are some of the git links that are very useful and helped me over the time in working with git. If you have other suggested links feel free to send them to me or add the in comments.

http://git-scm.com/book - A must read (free pdf book!)

http://gitimmersion.com/index.html - Downloadable tutorial and try out commands on sample

http://sitaramc.github.com/master-toc.html - Excellent coverage of git

http://pcottle.github.io/learnGitBranching/?demo - Learning git with excellent demo!

http://marklodato.github.com/visual-git-guide/index-en.html - A nice quick visual guide of git

http://www.vogella.com/articles/Git/article.html - Another excellent intro to git

http://ndpsoftware.com/git-cheatsheet.html#loc=workspace - A dynamic visual cheat sheet

http://www.gitguys.com/topics/ - A detail intro with presentations

http://nvie.com/posts/a-successful-git-branching-model/ - Great article on branching strategy

http://steveko.wordpress.com/2012/02/24/10-things-i-hate-about-git/ - A view on why git is bad for version control

http://stackoverflow.com/questions/871/why-is-git-better-than-subversion - A view on why git better than svn

Avoid git prompt for username and password

While working with git you can either enter username and password at each time you perform operations like git pull, push, etc. In the following example, I will be using github repository for remote with local repos on my Mac. I will also be using my own account and company user account to access two different github accounts and their respective repos.

Simple setup:

One account only and access to its respective repositories.

Here just modify your ~/local_repo/.git/config file to either use https or ssh connection. Typical config looks like

DW Schema Sizes

[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
    ignorecase = true
[remote "origin"]
    fetch = +refs/heads/*:refs/remotes/origin/*
    # url = https://github.com/USER_OR_ORG_NAME/REPO.git
    url = ssh://git@github.com/USER_OR_ORG_NAME/REPO.git
[branch "master"]
    remote = origin
    merge = refs/heads/master
...
...

With command line git pull/push and ssh pub-key setup at github.com. See how to provide the pub-key to github. If you already have a ssh key and you would like to use the same just add entry in the ~/.ssh/config file with corresponding entries. See below for more details on how to do this.

By default when cloning is done with git clone command the config file generated will have entry of "url = https://github.com" (assuming you are cloning from github repo). See how to switch from https to ssh for a given repo or you can comment out the line as above and enter "ssh" link shown above in the example. Git config (git_config) has too many options and you can control different aspects of git behavior.

Multiple account setup:

Among many but another scenario is when using two or more different github accounts from the same computer/system. Under this ~/.ssh/config file will be useful. Here ssh setup is required and https doesn't help much.

Other variations would be multiple accounts with different service providers not just github. This is little bit easier than having multiple accounts with same service providers.

Example: Github two accounts - doe_work and doe_personal and for each account you create two separate ssh keys with ssh-keygen.

> ssh-keygen -t rsa -f ~/.ssh/id_rsa_doe_work_git -C "Job new_key doe@company.com"
> ssh-keygen -t rsa -f ~/.ssh/id_rsa_doe_personal -C "Personal key"

Now configure both accounts to use ssh and its config - ssh_config. Important section in ssh_config is the "host" which is a string pattern to match in the config file to get it's respective options/ variables. You obviously should add the newly created public keys to github under settings->SSH Keys->Add SSH Key. Sample pub key for above.

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDdnbxKkCrYUv3YbutC2Dw6jIhQWLNIzNA3Ec6inlmrngwB33fCaEP4ZiOzPq8A0BRBCyV HYhC3txA9Jn1tRXVZ4tUGEslvN2qF2HNXJhSx8V5Vk1r3LmWe1uehOjAekSK0apELpkafSwigzgkm9oAmbNQ5p0N1e8ar/TXbOOzWVMRu9K G/fILuHf90UZ4H5hOrZov9eZSwabnSMvORirizFXYZPp/FQ30fV3wZJKJoNnmOY+/txjnNc+mikYiezjeA66vWlDGfJQ+Xlb+i1bnXoxBfv hrE/nSuSUVNmGy0bYPOFwbxPrnz0jFGCgdUh7KfKD2yE/gc0abhW0nyxkP Job new_key doe@company.com

...
ServerAliveInterval 60
ServerAliveCountMax 60
...
Host github.com
    HostName github.com
    # User doe_work
    IdentityFile ~/.ssh/id_rsa_doe_work_git
...
Host github-mine
    HostName github.com
    # User doe_personal
    IdentityFile ~/.ssh/id_rsa_doe_personal
...
Now in each repo change /.git/config to use the above ssh/hosts.
[remote "origin"]
    fetch = +refs/heads/*:refs/remotes/origin/*
    url = ssh://git@github.com/USER_OR_ORG_NAME/REPO.git
and
[remote "origin"]
    fetch = +refs/heads/*:refs/remotes/origin/*
    url = ssh://git@github-mine/USER_OR_ORG_NAME/REPO.git

The three configuration files discussed are used by git to perform handshake with remote server and authenticate for each of git pull, push, fetch etc.

Large data mysqldump to Greenplum

Recently I needed to load a single table from a transaction system with few hundred million rows into Greenplum/Postgresql from MySQL. MySQL schema didn't have many tables but one single table was large with around 50G size including data and index. Ended up testing with 2 different techniques below.

Technique 1: Using mysqldump and Postgresql inserts

In the beginning I thought it would be pretty straight forward with mysqldump I could be able to use postgres load utility

> psql -h HOST -U USER -f FILENAME

but it turned out be intersting challenge with many changes needed to load successfully.
Another minor quirk was transaction systems was using Clustrix a specific vendor version of MySQL. It's dump creates a file that is not fully compatible with direct load into postgresql. Dump even with --compitable=postgresql option didn't help much.

One of the major issue while loading huge file with psql utility the "Out of memory" error even with reasonably small file, say 1G.

ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 0 bytes by 1414939983 more bytes.

As a first step I removed all MySQL comments and anything other than data with INSERT INTO statement.

Example lines removed are below.

-- MySQL dump 10.13  Distrib 5.1.42, for krobix-linux-gnu (x86_64)
--
-- Host: localhost    Database: supply_production
-- ------------------------------------------------------
-- Server version 5.0.45

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;

And retained any lines between
INSERT INTO line and ENABLE KEYS line.
Used a script to perform the filtering.

This gave me all the data I needed with only few hundred lines with each line as long as 10 or more MB! These are long lines with thousands and thousands of records. At certain intervals, 100,000 or so, Clustrix inserted new row with "INSERT INTO ...". I removed these extra inserts comands and split the records with perl simpel one liner

>  perl -pi -e 's#\)\,\(#\)\,\n\(#g'

thus inserting new line at the end of each record and the new file had around 200 million lines now.

With continued error of "Out of memory" you will be kind of misled to believe that the Greenplum is slurping in all data into memory and trying to load which in first place shouldn't be the case. With INSERT INTO .... VALUES ( .... ) statement there is no need to do so. Next option was to find the possible error by splitting the file into smaller files and adding INSERT INTO statement at the beginning of each line and then removing the trailing "," at the end of last line.

After trying 10 million, 1 million and 0.5 million, Greenplum started throwing appropriate error like non-existing table (this is because the path was not set for postgresql), missing "," etc.

Split command used

> split --lines=500000 FILENAME

Adding "INSERT INTO ...." to each of these files and instead of seeking to end of file and removing extra ",", I added a new dummy line which I can delete later from uploaded table.

> for fn in `ls x*`;
    do echo "Working on $fn";
      echo "INSERT INTO schema.table VALUES " > "${fn}_r_l";
      cat $fn >> "${fn}_r_l";
      echo "(column_1, column_2, column_3,....column_N)" >> "${fn}_r_l" ;
   done

This created for each split file corresponding file with "_r_l" suffix (ready_to_load).

Then loaded the table

> for fn in `ls xd*_r_l`;
    do
      echo "Loading $fn";
      psql -h HOST -U USER -d DATABASE -f "FILENAME";
    done

Systems and utilities used:

Greenplum DB - Greenplum Database 4.0.6.0 build 4
Postgresql - PostgreSQL 8.2.14
MySQL - 5.0.45-clustrix-v4.1
Perl - 5.8.8 multithreaded
Bash
All running on linux x86_64 with 24G memory

There were more than 400 files with 0.5G data loaded in less than three hours. Still substantial but it is one time load and was acceptable.

Technique 2: Using mysqldump and Greenplum gpload

Greenplum's bulk loading utility (gpload) is an excellent one to load large data set. After dumping the data and cleaning, formatting it into a few files of 10G each, you can use gpload as below.

gpload -f $gpload_ctl_file

with control file created dynamically from a template. For example in the below table replace all place holders with respective values. With dynamically created control file (and no hard-coded values) the technique can be used for daily bulk loads as well.

VERSION: 1.0.0.1
DATABASE:
USER:
HOST:
PORT:
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
-
PORT:
FILE:
-
- FORMAT: text
- DELIMITER: '|'
- NULL_AS: 'NULL'
- ERROR_LIMIT: 25
- ERROR_TABLE: sandbox_log.gpload_errors
- COLUMNS:
- timestamp: text
- priority: text
...
...
PRELOAD:
- TRUNCATE: false
OUTPUT:
- TABLE:
- MODE: INSERT

This is a much faster and efficient loading than technique 1.

HTH,
Shiva

Recursion defined

Recursion, see Recursion. :)

Something defined in terms itself. Or sometimes CS scientists or programmers making point through

GNU - "GNU's Not Unix"
YAML - "YAML Ain't Markup Language"

Or beautiful Sierpinski Traingles

When a function call's itself some interesting things happen behind the scene like holding onto the variables which later used when computer execution unwinds the stack. In a typical example of recursion in solving a factorial, one may write

#!/usr/bin/env perl

use strict;
sub factorial {
    my $v = shift;
    return 1 if $v == 1;
    return $v * factorial($v - 1);
}
factorial(5);

When a first call is made to factorial(5), the execution jumps to factorial function (subroutine) and gets to last line, where while evaluating encounters another function call to factorial ($v -1) which then again makes a call to function or subroutine. This pushes a new call frame on to stack (with args). If a function returns it is pop-ed out of the stack and done (lost).

Few things are working together with call stack, heap, garbage collector (which removes any memory of any variable or func obj that doesn't have reference count 1 or more), and execution system.

Now to see more on recursion you can try the following

  1 #!/usr/bin/env  perl
  2 $! = 1;
  3 use strict;
  4 use IO::Handle;
  5 use Carp qw(cluck);
  6
  7 STDOUT->autoflush(1);      # Flush output immediately
  8 STDERR->autoflush(1);
  9
 10 sub factorial {
 11     my $v = shift;
 12  
 13     dummy_func();             # Sub that returns immediately printing call stack
 14     return 1 if $v == 1;
 15     print "Variable v value: $v and it's address:", \$v,
                     "\nCurrent sub factorial addr:", \&factorial, "\n","-"x40;
 16     return $v * factorial($v - 1);    # Builds on call for each func call
 17 }
 18  
 19 sub dummy_func {
 20     cluck;
 21 }
 22
 23 factorial(5);

Resulting output:

  1     main::dummy_func() called at ./t_recursion.pl line 13
  2     main::factorial(5) called at ./t_recursion.pl line 23
  3 Variable v value: 5 and its address:SCALAR(0x7ff6240546a0)
  4 Current sub factorial addr:CODE(0x7ff62402f2c0)
  5 ----------------------------------------
  6     main::dummy_func() called at ./t_recursion.pl line 13
  7     main::factorial(4) called at ./t_recursion.pl line 16
  8     main::factorial(5) called at ./t_recursion.pl line 23
  9 Variable v value: 4 and its address:SCALAR(0x7ff6240610e8)
 10 Current sub factorial addr:CODE(0x7ff62402f2c0)
 11 ----------------------------------------
 12     main::dummy_func() called at ./t_recursion.pl line 13
 13     main::factorial(3) called at ./t_recursion.pl line 16
 14     main::factorial(4) called at ./t_recursion.pl line 16
 15     main::factorial(5) called at ./t_recursion.pl line 23
 16 Variable v value: 3 and its address:SCALAR(0x7ff6240612f8)
 17 Current sub factorial addr:CODE(0x7ff62402f2c0)
 18 ----------------------------------------
 19     main::dummy_func() called at ./t_recursion.pl line 13
 20     main::factorial(2) called at ./t_recursion.pl line 16
 21     main::factorial(3) called at ./t_recursion.pl line 16
 22     main::factorial(4) called at ./t_recursion.pl line 16
 23     main::factorial(5) called at ./t_recursion.pl line 23
 24 Variable v value: 2 and its address:SCALAR(0x7ff624061538)
 25 Current sub factorial addr:CODE(0x7ff62402f2c0)
 26 ----------------------------------------
 27     main::dummy_func() called at ./t_recursion.pl line 13
 28     main::factorial(1) called at ./t_recursion.pl line 16
 29     main::factorial(2) called at ./t_recursion.pl line 16
 30     main::factorial(3) called at ./t_recursion.pl line 16
 31     main::factorial(4) called at ./t_recursion.pl line 16
 32     main::factorial(5) called at ./t_recursion.pl line 23

When recursion script is kicked-off, it pushes factorial(5) first frame on to the call stack (line 2 above) which calls another dummy_func which then goes on to the stack (line 1). Hence when cluck is called in dummy_func there are two calls on the stack along with any arguments passed.

Then dummy_call returns and is pop-ed from the stack. Program moves to line 15 (script above) and evaluates to false. Then prints lines 3&4 output ($v and its location, factorial sub location).

Script line 16 calls factorial which pushes the new function call on to stack and at the point the value of $v is 5. The function and this variable are in same scope and on stack. So later when this function returns is multiplied with $v (value 5).

When factorial is called 2nd time (but first time at line 16 and pushed onto call stack) $v is reduced by 1 ($v -1) which is then copied and execution starts at top of this subroutine again. Remember copy of definition of function always the same at some location (CODE(0x7ff62402f2c0)) in program memory.

This execution then calls dummy_func which spits out the call stack and as you expected now you have dummy_func at top, 2nd factorial in middle and 1st factorial call at bottom. Stack is FILO (First In Last Out or LIFO - Last In First Out). Then execution moves to lines 14 & 15. Output looks like:

  6     main::dummy_func() called at ./t_recursion.pl line 13
  7     main::factorial(4) called at ./t_recursion.pl line 16
  8     main::factorial(5) called at ./t_recursion.pl line 23
  9  Variable v value: 4 and its address:SCALAR(0x7ff6240610e8)
 10 Current sub factorial addr:CODE(0x7ff62402f2c0)

At script line 16 the recursion continues and you get output lines 12 to 32. At the last function the base or terminal condition of recursion is met ( return 1 if $v == 1; ) and it returns 1.

factorial of 1 => 1! = 1;

Now the stack rewinding begins, the return value of 1 (when factorial (1) returned) is multiplied with the variable $v (value 2) and results in 2 which is returned by return $v * factorial($v - 1); statement.

Finally, 5! = 120.

All this happen behind the scene and it might be just better to know and recognize the common pattern when this happen :). I wouldn't worry about how the implementation is done when I run query like

SELECT column_N FROM table_X;

It is so darn simple but so much goes behind that SQL statement from mapping table to file and exact location in file to extract correct values. It is all hidden from the application program.

For more details take a look at "Call Stack" or "Activation Record".

But if you like to dig deeper through debugging, try

> perl -d t_recursion.pl
Loading DB routines from perl5db.pl version 1.33
Editor support available.

Enter h or `h h' for help, or `man perldebug' for more help.

main::(t_recursion.pl:2):	$! = 1;
  DB<1> n
main::(t_recursion.pl:7):	STDOUT->autoflush(1);
  DB<1> n
main::(t_recursion.pl:8):	STDERR->autoflush(1);
  DB<1> n
main::(t_recursion.pl:23):	factorial(5);
  DB<1> s
main::factorial(t_recursion.pl:11):
11:	    my $v = shift;
  DB<1> s
main::factorial(t_recursion.pl:13):
13:	    dummy_func();
  DB<1> s
main::dummy_func(t_recursion.pl:20):
20:	    cluck;
  DB<1> T
. = main::dummy_func() called from file `t_recursion.pl' line 13
. = main::factorial(5) called from file `t_recursion.pl' line 23
  DB<1>

Using Webfonts or font-face

With all new browsers supporting CSS3, it is easy to use .woff (Web Open Font Format) and content page developer aren't limited to user computer locally available fonts. W3 discusses CSS3 as

With CSS3, web designers are no longer forced to use only web-safe fonts

Along with many others Google provides access to reportire of fonts that can be used. As of early 2013 there were more than 615 font types available.

And using the @font-face the browser links to the requested font and use it. Multiple @font-face rules can applied and used selectively with in the same page or site. In Google dynamic views, for example, with embedded code

@font-face {
  font-family: 'Great Vibes';
  font-style: normal;
  font-weight: 400;
  src: local('Great Vibes'), local('GreatVibes-Regular'), url(http://themes.googleusercontent.com/static/fonts/greatvibes/v1/6q1c0ofG6NKsEhAc2eh-3brIa-7acMAeDBVuclsi6Gc.woff) format('woff');
}

one can set the required selector to this font. To get the font-face rules of above click on "Quick-use" link on Google webfonts above and paste the href into broswer. W3 Draft on CSS3 explains these rules and all the CSS3 details.

To access the Blogger dynamic views template login into your blogger home and select Template > Customize (button) > Advanced > Add CSS. Paste @font-face rules like below.

In the above example the 'Great Vibes' fonts is applied to titles of the pages/content.