Data and Technology: 2009

SEM - Google Adwords API Data Retriever

If you are running many large campaigns or thousands of campaigns and want to track keyword performance of Google Adwords, the following application helps retrieve the data on regular intervals (daily, weekly, monthly, etc.). Using your data warehouse and reporting infrastructure you can slice and dice the data based on your business requirements.

Application is a multi-threaded implementation of pulling data through API. Each thread gets a campaign id to pull keyword performance report. The code included here is a skeletal code and you need to add error handing, code to suit your database schema and some ETL. In my environment I have used MySQL 5.0, linux (2.6) and Windows, Perl 5.10.1 and Google Adwords API v13. You can also use Google's client libraries available in many languages including Java, Python, DotNet, Perl, PHP and others.

Other posts that may be of interest:
1. Google Analytics Data Retriever
2. Bulk data loading using Talend
3. Google Keyword Performance data pull

Campaign Class:

package Campaign;

#-------------------------------------
# Campaign Package
#-------------------------------------

use strict;
use DBI;

sub new {
    my $self = {};
    $self->{campIds} = [];  #-- Master campaign ID array

    bless($self);

    $self->_populateCampaignIds();
    return $self;
}

sub _populateCampaignIds {
    my $self = shift;

    #--------------------------
    #- Connect to database to get list of campaigns (IDs).
    #--------------------------
    #- my $dbh = DBI->connect($data_source, $user_name, $pw);
    #- my $sth = $dbh->prepare(
    #-                  "SELECT campaign_id from campaign
    #-                   WHERE  search_engine = 'google'
    #-                     AND  status = 'active';
    #-                  ";
    #- $sth->execute() or die "Unable to execute a query.\n";
    #- while ( @id = $sth->fetchrow_array ) {
    #-     push (${$self->{campIds}}, $id[0]);
    #- }
    #-
    #- $dbh->disconnect;

    #---- For demo, I have hard-coded some IDs.
    push( @{$self->{campIds} }, '1111');
    push( @{$self->{campIds} }, '2222');
    push( @{$self->{campIds} }, '3333');

    return 1;
}

#----------------------------
#- Returns master list of campaign IDs
#----------------------------
sub getCampaignIds {
    my $self = shift;

    return @{ $self->{campIds} };
}

#----------------------------
#- Returns chunk of campaign IDs from master array.
#----------------------------
sub getCampaignIdChunk {
    my $self = shift;
    my $chunks = shift;
    my $index = shift;

    if (!defined $index and !defined $chunks) { 
        return undef; 
    }

    my $chunkSize = int (scalar(@ {$self->{campIds}})/$chunks);
    if ($chunkSize < 1) {
        print "Error: no campaigns to process for thread $index\n";
        return undef;
    }
    my $lIx = $chunkSize * ($index - 1);  #-- low index
    my $hIx = $chunkSize * ($index);      #-- high index

    my @campIds = ();
    if ($index < $chunks) {
        @campIds = splice (@ {$self->{campIds}}, $lIx, $chunkSize);
    } else {
        #- print "Getting LAST chunk for $index from lower $lIx element and a chunk size of $chunkSize\n";
        @campIds = splice (@ {$self->{campIds}}, $lIx);
    }

    return @campIds;
}

1;

Prototype application:

#-------------------------------------------------
# Adwords Data Retriever
#   This is skeleton application to pull Adwords Keywords Performance Report Data
#   from Google through Adwords API.
#   Uses multi-threaded programming technique for simultaneous pull of many campaigns. 
#   Campaign ids in the application are hard-coded with dummy values and hence replace them by connecting
#     to your database and getting actual campaign ids (code shown).
#   Also, for authentication use your user name, password, developer & application code.
# Valid for Google Adwords v13.  
# Author: Shiva M.
#-------------------------------------------------

#- use lib "Add directory where you installed Campaign.pm";
use strict;
use Benchmark;
use Getopt::Std;
use Data::Dumper;

use XML::Simple;
use LWP::Simple;
use Crypt::SSLeay;
use SOAP::Lite;
use IO::Uncompress::Gunzip qw();

use threads;
use threads::shared;

use Campaign;

my (%opts, @worker) = undef;
my $NUM_THREADS = 3;     #-- Modify this to suit your environment. Or better use Config module with ini file to setup application configuration.
my $GOOGLE_API = 'https://adwords.google.com/api/adwords/v13';

#-------------------------------------------------
# Main program.
#-------------------------------------------------
getOpts();

if ($opts{'s'} == 1) {
    for (my $i = 1; $i < ($NUM_THREADS+1); $i++) {
        print "MASTER: Starting thread " . $i, "\n";
        $worker[$i] = threads->new(\&pullCampaignData, $i);
        sleep 5;
    }

  while(anyRunningThread()) {
     for (my $ii = 1; $ii < scalar(@worker); $ii++) {
         my $thr = $worker[$ii];
         if ($thr->is_joinable) {
            my $ret = eval { $thr->join() };
            print "Thread $ii returned: $ret\n";
         }
     }
     sleep(10);
   }
}

exit (0);

#-------------------------------------------------
# Subroutines.
#-------------------------------------------------
sub getOpts {
    if (!getopts('s:d:', \%opts)) {
        print "Invalid option. Exiting\n";
        exit (-1);
    }

    return 1;
}

#-----------------------------
#- Any thread still running?. Returns Yes/No (1/0)
#-----------------------------
sub anyRunningThread {
    my $runThrs = threads->list(threads::running);
    print "Still $runThrs threads are running.\n";
    if ($runThrs) { return 1; }

    return 0;
}

#---------------------------
#- Each thread to pull keyword performance reports for a list of campaigns
#---------------------------
sub pullCampaignData {
    my ($tId) = @_;
    my $thr = $worker[$tId];

    my $to_dollar_conv_factor = 1 /(1000 * 1000);  #-- Currency in micros. $1 <=> 1000 * 1000;

    my $log = "thread_$tId.log";
    open (TFH, ">$log") or die "Unable to open file $log\n";   #-- Log file for the thread.

    #---------------------------------
    #- Get this thread share of load - list of campaign IDs
    #---------------------------------
    my $campaign = Campaign->new();
    my @tCampIds = $campaign->getCampaignIdChunk($NUM_THREADS,$tId);

    print TFH "\npullCampaignData \$tId: $tId: campaignIDs:", map {"$_ "} @tCampIds;

    my @headers = (SOAP::Header->name('email'            => $email),
                   SOAP::Header->name('password'         => $pw),
                   SOAP::Header->name('useragent'        => $ua_name),
                   SOAP::Header->name('developerToken'   => $developer_token),
                   SOAP::Header->name('applicationToken' => $application_token)
                  );

    my $url = sprintf('https://%s.google.com/api/adwords/v13/ReportService', 'adwords');
    my $wsdl = $url . '?wsdl';
    my $service = SOAP::Lite->service($wsdl)->autotype(0)->readable(1)->proxy($url);
    print TFH "\ndebug:service=$service" if $opts{d};

    $service->on_fault(
                sub {
                     my $response = $::ARG[1];
                     die('The following SOAP fault occurred:', "\n", 'faultcode: ',
                         $response->faultcode(), "\n",
                         'faultstring: ', $response->faultstring(),
                         "\n");
                }
    );
    SOAP::Lite->import(+trace => 'debug') if $opts{d};


    #---------------------------
    # Make SOAP call.
    # Get one campaign data at a time. Generate a file to bulk load into database.
    #---------------------------

    foreach my $cId (@tCampIds) {

        my $AWJobParams = {
                  'selectedReportType' => 'Keyword',
                  'name'               => 'AW Keyword Perf Daily Report',
                  'aggregationTypes'   => ['Daily'],
                  'startDay'           => $yesterday_date,
                  'endDay'             => $yesterday_date,
                  'selectedColumns'    => [ qw (
                     KeywordId Keyword CampaignId CampaignStatus
                     AdGroupId AdGroupStatus AdGroupMaxCpa AdWordsType
                     Impressions Clicks CTR CPC Conversions ConversionRate Cost
                     AveragePosition QualityScore AverageConversionValue BottomPosition
                     CPM CostPerConverstion CostPerTransaction DailyBudget DefaultCount
                     DefaultValue FirstPageCpc KeywordDestUrlDisplay KeywordStatus
                     KeywordTypeDisplay LeadCount LeadValue MaxContentCPC MaximumCPC
                     MaximumCPM PageViewCount PageViewValue SalesCount SalesValue
                     TopPosition TotalConversionValue Transactions ValuePerClick
                     ValuePerCost CustomerName ExternalCustomerId )  
                                          ],
                  'crossClient'        => 1,
                  'campaigns'          => [$cId]
        };


        #------------------------------
        # Step 1. Define a job
        #------------------------------
        my $report_job = SOAP::Data->name('job' => $AWJobParams);
        $report_job->attr({'xsi:type' => 'DefinedReportJob', 'xmlns' => "$GOOGLE_API"});

        #------------------------------
        # Output XML file after unzipped.
        #------------------------------
        my $output = "${cId}_kw_perf.xml";


        #-------------------------------
        # Catch any exception
        #-------------------------------
        eval {

            #------------------------------
            # Step 2. Validate report.
            #------------------------------
            $service->call('validateReportJob' => $report_job, @headers);
            print TFH "debug: about to schedule report \n";

            #------------------------------
            # Step 3. Schedule report.
            #------------------------------
            my $job_id = $service->call('scheduleReportJob' => $report_job, @headers)->result();
            $job_id = SOAP::Data->name('reportJobId' => $job_id);

            #------------------------------
            # Step 4. Get report status. Wait until finish or error.
            #------------------------------
            my $status = $service->call('getReportJobStatus' => $job_id, @headers)->result();
            while ($status ne 'Completed' and $status ne 'Failed') {
                print TFH "Report job status is $status\n";
                sleep(30);
                $status = $service->call('getReportJobStatus' => $job_id, @headers)->result();
            }
            if ($status eq 'Failed') {
                die("Report job generation failed.\n");
            }

            #------------------------------
            # Step 5. Get report URL to download the file.
            #------------------------------
            my $report_url = $service->call('getGzipReportDownloadUrl' => $job_id, @headers)->result();
            print TFH "Report is available at $report_url\n" if $opts{d};

            #------------------------------
            # Step 6. Get the data file (report).
            #------------------------------
            my $gzip_report_data = LWP::Simple::get($report_url);


            #------------------------------
            # Step 6a. Save it to local file.
            #------------------------------
            my $localFile = "${cId}_kw_perf.xml.gz";
            open(LFH, ">$localFile") or die "Could not open $localFile: $!\n";
            print LFH $gzip_report_data . "\n";
            close LFH;


            #------------------------------
            # Step 7. Unzip the file and build xml hash.
            #------------------------------
            IO::Uncompress::Gunzip::gunzip(\$gzip_report_data => $output);

            my $xml_hash = XMLin($output);
            print TFH Dumper($xml_hash) if $opts{d};

            #------------------------------
            # Step 8. Create database bulk load file.
            #------------------------------
            open (BLFH, ">${cId}_bulk_load_input.txt") or die "Unable to create bulk load file: $!\n";


            #-- Uncomment the fields that are of interest to you.
            foreach my $row (@{$xml_hash->{table}->{rows}->{row}}) {
                print BLFH
                    $row->{keywordid},"\t", $row->{keyword},"\t",
                    $row->{campaignid},"\t",
                    # $row->{campaign},"\t",
                    $row->{campStatus},"\t", $row->{adgroupid},"\t",
                    # $row->{adgroup},"\t",
                    # $row->{agstatus},"\t",
                    # $row->{agMaxCpa} * $to_dollar_conv_factor,"\t",
                    # $row->{adwordsType}, $row->{keywordid}, $row->{keyword}, $row->{kwStatus},
                    # $row->{customerid}, $row->{acctname}, $row->{kwDestUrl},
                    $row->{imps},"\t", $row->{clicks},"\t", $row->{ctr},"\t",
                    # $row->{cpc} * $to_dollar_conv_factor, $row->{conv},
                    # $row->{convRate},
                    $row->{cost} * $to_dollar_conv_factor,"\t",
                    $row->{pos},"\t",
                    $row->{qualityScore},"\t",
                    # $row->{avgConvValue} * $to_dollar_conv_factor,
                    # $row->{bottomPosition}, $row->{cpm} * $to_dollar_conv_factor,
                    # $row->{costPerConv} * $to_dollar_conv_factor,
                    # $row->{cpt} * $to_dollar_conv_factor,
                    # $row->{budget} * $to_dollar_conv_factor, $row->{defaultConv},
                    # $row->{defaultConvValue} * $to_dollar_conv_factor,
                    # $row->{firstPageCpc} * $to_dollar_conv_factor,
                    $row->{kwDestUrl},"\t", $row->{kwType},
                    # $row->{leads}, $row->{leadValue} * $to_dollar_conv_factor,
                    # $row->{maxContentCpc} * $to_dollar_conv_factor,
                    # $row->{maxCpc} * $to_dollar_conv_factor,
                    # $row->{maxCpm} * $to_dollar_conv_factor, $row->{pageviews},
                    # $row->{pageviewValue} * $to_dollar_conv_factor,
                    # $row->{purchases},
                    # $row->{purchaseValue} * $to_dollar_conv_factor,
                    # $row->{topPosition},
                    # $row->{convValue} * $to_dollar_conv_factor,
                    # $row->{transactions}, $row->{convVpc},
                    # $row->{valueCostRatio},
                    "\n";
            }
            close(BLFH);
        };

        if ($@) {
            print("Report job failed validation due to the following error: ", $@, "\n");
            exit (-1);
        }
    }

    close(TFH);
    return 1;
}

Bulk loading into data warehouse staging and perform summarization.

#- Psuedo code
#-  1. Cleanup or perform any needed transformation of data in the application.
#-  2. Concatenate all the campaign keyword performance data to a file(s).
#-  3. Bulk load file into a staging table.
#-  4. Transform and summarize the data through SQL.

Enjoy,
Shiva

Adwords Keyword Performance data pull

If you are handling multiple accounts or a large account at Adwords center managing few tens of thousand of keywords to millions of keywords, you would be interested in tracking the keyword performance over a period of time. You can do so by pulling daily summary report from Google through Adwords API and storing the data in your data mart or data warehouse for further analysis.

Pulling the data through API involves 6 steps:
   1. definedReportJob object - Specifies the type of report you want.
   2. validateReportJob - Validate the report that you would like to get. Returns nothing.
   3. scheduleReportJob - Now schedule the job for above object. This request returns an ID.
   4. getReportJobStatus - Get status of the report. If "Completed" go to next.
   5. getReportDownloadUrl - Retrieve the URL where your report is stored. The returned URL is valid for five minutes. If the URL expires, just call getReportDownloadUrl again. Keep this URL private, since anyone can use it to access your report.
   6. getGzipReportDownloadUrl - Download your report by sending an HTTP GET request.

Following pseudo code and code explains each step execution in more detail. I use Perl SOAP calls to generate Adwords requests. Also, note the following code works well with API version v13 and Google is transitioning many of its Adwords API to v2009. As of Dec.2009 ReportService was still available only in v13.

Google NS Individual Search Box

Today, while searching for "cygwin" keyword I was presented with results that caught my eye. The first result had its own search box. See picture below. Had anyone seen this before? I liked the feature and pretty handy if you are looking for a specific information from a specific site. I entered "latest version" and hit "Search cygwin.com" that showed search entry "latest version site:cygwin.com" in the next search. Nice.

Other posts that may be of interest:
1. Best Google home page logos
2. Better quicker SEM ROI
3. 99 bottles of beer in different laguages

Cheers,
Shiva

Forecasting SEM ROI with Eureqa

There is a is powerful tool to detect equations and hidden relationships among the data sets. What makes it irresistible tool is its cost - FREE!. It is Eureqa. In this blog I will explain how to find relationship between cost and long term revenue for a keyword campaign in Google Adwords. This is a simplied version of identifying a relationship at macro level and then drilling down further to identify more details.

The sample I have taken is over two months summarized data at daily level for a campaign. Since cost and revenue are affected by other factors including the keywords or ad words that comprised a campaign, competitors' bidding, landing page or page optimizations, etc., it is important to note that the following is to get a quick insight into how cost is related to revenue for a particular campaign. This is one of important arsenal out of few to get better positive ROI.

The data set is shown below in the application's tab 1(Enter Data). This very much looks like Excel spreadsheet!. First column is cost and second is long term revenue.

Related Posts: Generating quick positive SEM ROI

Quick ROI for SEM (Adwords) Defined

SEM or Internet paid advertisement or PPC or Adwords or what ever one may call, has been on high growth rate for many years now and has shown that it will continue to grow in the near future with the potential to grab market share from other advertisement channels. This has implications on all companies that would like to have their products or services advertised. By quickly responding to the market changes and being nimble, they can effectively spend their marketing dollars with positive ROI.

Though I use Google Adwords in the following explaination, you can replace the same logic for Yahoo and MSN. I assume that you are somewhat familiar with natural search, paid search, clicks, impressions and related; and what they mean.

Related posts: Quick ROI forecasting technique

Updating MySQL root password in Windows

Recently on one of my personal desktop running Windows Vista OS I installed latest MySQL (64 bit, 5.1.41-Community version). Installation was straight forward and every thing went well including default configuration for DSS (Decision Support System). While all was well, I did not revisit the system for few days and forgot the password. Here are the steps I took to fix with the new password.

NOTE: Perform the following as a system an "Administrator".

1. Stop the service ("MySQL") from the task manager.

2. Using cygwin window, get to "/cygdrive/c/Windows/system32" directory and run

/cygdrive/c/Windows/system32> sc queryex MySQL
#-- If it is running you will see something like this:
# SERVICE_NAME: MySQL
#        TYPE               : 10  WIN32_OWN_PROCESS
#        STATE              : 4  RUNNING
#                                (STOPPABLE, PAUSABLE, ACCEPTS_SHUTDOWN)
#        WIN32_EXIT_CODE    : 0  (0x0)
#        SERVICE_EXIT_CODE  : 0  (0x0)
#        CHECKPOINT         : 0x0
#        WAIT_HINT          : 0x0
#        PID                : 92940
#        FLAGS              :
#
#-- If mysql was stopped, then you will see
# SERVICE_NAME: MySQL
#        TYPE               : 10  WIN32_OWN_PROCESS
#        STATE              : 1  STOPPED
#        WIN32_EXIT_CODE    : 0  (0x0)
#        SERVICE_EXIT_CODE  : 0  (0x0)
#        CHECKPOINT         : 0x0
#        WAIT_HINT          : 0x0
#        PID                : 0

#-- Note: sc.exe: This tool provides a way to communicate with Service Controller
#-- (Services.exe) from the command prompt, to retrieve information about services.

3. Run

/cygdrive/c/Program Files/MySQL/MySQL Server 5.1/bin> ./mysqld.exe --skip-grant-tables &
#-- Output will be something like:  [1] 5678 

/cygdrive/c/Program Files/MySQL/MySQL Server 5.1/bin> ./mysql --user=root mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.1.41-community MySQL Community Server (GPL)

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>      #-- Now you are connected to mysql server with access control disabled.
mysql> update user set Password=PASSWORD('YOUR-NEW_PASSWORD-HERE') WHERE User='root';
Query OK, 0 rows affected (0.00 sec)
Rows matched: 1  Changed: 0  Warnings: 0

mysql> flush privileges;   #-- This reloads the privileges from the grant tables in the mysql database.
Query OK, 0 rows affected (0.01 sec)

mysql> select version();   #-- A quick test...
mysql> exit;               #-- Exit from mysql

/cygdrive/c/Program Files/MySQL/MySQL Server 5.1/bin> fg    #-- Now back at the cygwin prompt bring the background process and kill it. 
#-- Or kill it (mysqld) from task manager under "Processes tab".

Note: The msyql.com documentation (http://dev.mysql.com/doc/refman/5.0/en/resetting-permissions.html#resetting-permissions-windows) has similar technique, it seem to be not updated with the latest. For example, there is no "mysqld-nt" executable under bin directory! You don't need to create a text file as well to do the steps shown here.

4. Now start the MySQL service from task manager "Services tab".

5. Login to MySQL using the new password from CLI or Toad or any client you want to connect to database.

Other commands that might be of helpful are:

/cygdrive/c/Program Files/MySQL/MySQL Server 5.1/bin> ./mysql_upgrade.exe -u root -p
Enter password: ********
#- It checks all tables in all databases for incompatibilities with the current version of MySQL Server after any upgrades done to MySQL.

/cygdrive/c/Program Files/MySQL/MySQL Server 5.1/bin> ./mysql_client_test.exe --help

Cheers,
Shiva

Useful SEM SEO Links

Some of very useful links related to SEM and SEO fields.

Google Adwords

Official Blog
Adwords Tools

Google Analytics (GA)

Google

Google Labs

Google Analytics Data Retriever (API) - SEM, SEO

Google Analytics should be an integral part of any company's marketing data analysis strategy that plays in SEM, SEO field. Not too long ago Google released API for their Analytics and one can use this data or reports to compliment their in-house data analysis.

Here I have provided a small utility program that can be used to retrieve most popular six (6) reports and have used similar utilities that I have built to dump the data into MySQL data warehouse for much bigger data sets.

The application has two Perl modules (GAProfile.pm and GAReport.pm) and a main program (ga_report_retriever.pl). GAProfile.pm module is used to authenticate your account and get the auth_code that will be used in the subsequent calls to GA to get reports. GAReport.pm module is used to pull the report and dump important dimension and metrics to standard output. Put both these modules in the same directory (or copy them to directory in the path of PERL5LIB) as the main program and run it. To get different report, make a change in "ga_report_retriever.pl" passing the name of the report.

Before you start using the GA API, you may want to go through the documentation.

At command prompt> ./ga_report_retriever.pl

MySQL Toad result set filter and compare features

When you work daily in a data warehouse environment, you will inevitably compare or filter or summarize data for further analysis or debugging. The result set shown in MySQL Toad 4.5 has couple of very nice features that will help you compare or filter the results quickly - right in the "Results" pane. In the picture below I was calculating Herfindahl Index for a group of AdWords accounts. The query resulted in date, some daily metric and its index; and if I am further interested in filtering for specific range of index, for example, I can point the mouse at Herfindahl_Index column and click on "filter" icon. See below.

Also, by comparing the data between two result sets, the data comparison feature avoids bringing data to Excel or running sub-queries. For example, I modified the original query and ran it again and wanted to quickly see for any difference in numbers between two results sets at date level, which I could by just doing the data comparison between result 7 & 8 below. I know I have couple of ways of comparing the data as noted above, but being able to compare in Result set pane was the quickest - didn't need to modify the query nor move the data.

Finally, you can use Pivot & Chart by dragging and dropping the columns to the grid area. For small data sets this is handy.

Hope that helps,
Shiva

Sharing files between Linux and Windows

If you are running a guest and host operating systems, many a times you would like to share some drives or directories between them. It may be to share some documents or data or text files, etc. And to share a directory (example, C:/Share/Edu) between Windows XP host and Ubuntu guest OS you can do the following.

1. First allow for sharing on Windows by right click on the directory of interest (say, C:/Share/Edu) and select "Sharing And Security..." and it will bring up "Edu Properties" window. Select Sharing tab and set "Allow this number of users" to 2 or the number you like. Since the sharing is between guest and host OSes, setting it to 2 will suffice.

In Windows Explorer > Go to C:/Share/Edu > Right click and select "Sharing And Security" > (In new window) > Select "Sharing" tab > Set "Allow this number of users" radio button value to "2"

2. Assuming you have already installed Ubuntu (version 9 or above), go to System > Administration > Synaptic Package Manager as shown below which will bring up new window and look for smbfs (Samba File System) and choose it for download. Ubuntu will automatically select other packages that are needed for this package.
Note: You need to have sudo authority (or root privileges) to download and install a package.

3. Once done bring up a terminal on Ubuntu and run

hadoop@ubuntu> sudo -s  (if you are not root already)
root@ubuntu> mkdir /mnt/win_edu
root@ubuntu> mount -t smbfs -o username=YOUR_WINDOWS_USER_NAME  //WINDOWS_HOSTNAME/Edu  /mnt/win_edu
Password: ENTER_PASSWORD
# Replace YOUR_WINDOWS_USER_NAME, WINDOWS_HOSTNAME and ENTER_PASSWORD appropriately.
# This will mount windows shared "Edu" directory on to /mnt/win_edu mount point.
#  NOTE: Instead of "smbfs" option above you can also use "cifs - Common Internet File System" supported by Samba, Windows 2000, XP, etc.

4. Change directory to /mnt/win_edu from Ubuntu terminal and you should see all windows files in Edu directory. You can also check all mounted directories by running

root@ubuntu> mount

5. After you are done to unmount the file system simply run

root@ubuntu> umount -v -d /mnt/win_edu

Or if you just want to copy files between two systems use WinSCP freeware. Get the guest IP address by running ifconfig and look for "inet addr", then connect to it from WinSCP.

hadoop@ubuntu> ifconfig
eth1      Link encap:Ethernet  HWaddr 00:0c:29:f7:a0:41  
          inet addr:10.15.14.107  Bcast:10.15.15.255  Mask:255.255.254.0

Cheers,
Shiva

Beauty of 99 Bottles of Beer in Perl

Perl is well known for allowing a software developer to express the solution in more than one way. And it can be harnessed to be as creative or cryptic as one likes to be. Here is one example I saw sometime back on http://99-bottles-of-beer.net/language-perl-737.html

Run the program as:

#!/usr/local/bin/perl


    ''=~(        '(?{'        .('`'        |'%')        .('['        ^'-')
    .('`'        |'!')        .('`'        |',')        .'"'.        '\\$'
    .'=='        .('['        ^'+')        .('`'        |'/')        .('['
    ^'+')        .'||'        .(';'        &'=')        .(';'        &'=')
    .';-'        .'-'.        '\\$'        .'=;'        .('['        ^'(')
    .('['        ^'.')        .('`'        |'"')        .('!'        ^'+')
   .'_\\{'      .'(\\$'      .';=('.      '\\$=|'      ."\|".(      '`'^'.'
  ).(('`')|    '/').').'    .'\\"'.+(    '{'^'[').    ('`'|'"')    .('`'|'/'
 ).('['^'/')  .('['^'/').  ('`'|',').(  '`'|('%')).  '\\".\\"'.(  '['^('(')).
 '\\"'.('['^  '#').'!!--'  .'\\$=.\\"'  .('{'^'[').  ('`'|'/').(  '`'|"\&").(
 '{'^"\[").(  '`'|"\"").(  '`'|"\%").(  '`'|"\%").(  '['^(')')).  '\\").\\"'.
 ('{'^'[').(  '`'|"\/").(  '`'|"\.").(  '{'^"\[").(  '['^"\/").(  '`'|"\(").(
 '`'|"\%").(  '{'^"\[").(  '['^"\,").(  '`'|"\!").(  '`'|"\,").(  '`'|(',')).
 '\\"\\}'.+(  '['^"\+").(  '['^"\)").(  '`'|"\)").(  '`'|"\.").(  '['^('/')).
 '+_,\\",'.(  '{'^('[')).  ('\\$;!').(  '!'^"\+").(  '{'^"\/").(  '`'|"\!").(
 '`'|"\+").(  '`'|"\%").(  '{'^"\[").(  '`'|"\/").(  '`'|"\.").(  '`'|"\%").(
 '{'^"\[").(  '`'|"\$").(  '`'|"\/").(  '['^"\,").(  '`'|('.')).  ','.(('{')^
 '[').("\["^  '+').("\`"|  '!').("\["^  '(').("\["^  '(').("\{"^  '[').("\`"|
 ')').("\["^  '/').("\{"^  '[').("\`"|  '!').("\["^  ')').("\`"|  '/').("\["^
 '.').("\`"|  '.').("\`"|  '$')."\,".(  '!'^('+')).  '\\",_,\\"'  .'!'.("\!"^
 '+').("\!"^  '+').'\\"'.  ('['^',').(  '`'|"\(").(  '`'|"\)").(  '`'|"\,").(
 '`'|('%')).  '++\\$="})'  );$:=('.')^  '~';$~='@'|  '(';$^=')'^  '[';$/='`';

Character set, Collation

Here is an interesting challenge that I ran into in the Data Warehouse environment. While debugging a particular business user query, I joined between a string built in a sub-query to a column in a particular fact table with some 20million rows. BTW, I was using MySQL 5.0 and Toad 4.5 to run this query and instead of getting any result I got the following error!

MySQL Database Error: Illegal mix of collations (utf8_general_ci,COERCIBLE) and (latin1_swedish_ci,IMPLICIT) for operation '='    22    0

The query was simple and similar to

SELECT s.str, s.d1, s.d2
FROM (
    SELECT distinct 'string' str, dimension1 d1, dimension 2 d2
    FROM table_1
    WHERE ...
) s
LEFT OUTER JOIN table_2 t2
ON ...
WHERE s.str = t2.str
AND ...

Regular Expressions

MySQL regular expressions (RE) are a powerful tools that can be very useful in SQL string searches. They enable software engineer to build statements that are very concise and to handle complex string operations that otherwise wouldn't be possible.

If you are new to regular expressions or would like to know more about them, http://www.regular-expressions.info/ is a good site to visit. You can also get a RE tutorial at net|tuts+. Following is quick list of meta characters that can get you started in using them.

. => A dot matches single character

* => An asterisk matches zero or more of previous matched tokens

? => A question mark matches zero or one time the previous matched token

$ => A dollar at the end anchors the search to the end of string

^ => A caret symbol anchors the search to the beginning of the string

| => A pipe matches either of the two. Example: abc|xyz => either 'abc' or 'xyz'
{m,n} => A quantifier matching between 'm' and 'n' times. m & n are integers.

Different computer languages have some variations when it comes to more advanced searches and how they handle given character sets. MySQL uses REGEXP string function to implement and matches the string in case "in-sensitive" mode and to match otherwise see this blog.

MySQL User Defined Variable

This is a brief introduction to user defined variables in MySQL as many times I have gotten queries regarding how to use them in dynamic queries.

In MySQL you can easily create user defined variable and use it through out that particular session or client connection. A variable is a alphanumeric characters following a "@". In the versions 5 and above the name is case-insensitive and hence @VAR, @vaR all mean the same. For example:

set @var = 1;

The above statement creates a variable called "var" and sets it to 1. Also note you don't need to declare it before using it.

The statement,

set @var1 = 1, @var2 = 2, @var3 = 3;

sets all three variables in single statement. You can then select the variables from

select @var1, @var2, @var3;

Now, let us say that you would like to select particular row(s) from a given table. To do so you need to first build the sql string, prepare it and then execute it. This allows you to pass the variables one or more times to the same statement.

For this let us assume we have a table (table_t) that has two columns - id and name and has 3 rows (1, 'one'), (2, 'two') and (3, 'three'). To select row with id = 1

set @var = 1;

set @st = concat('select id, name from table_t where id = ', @var);

prepare stmt from @st;

execute stmt ;

And to select with string variable like names you need to escape the single quote as below.

set @var = 'one'

set @st = concat('select id, name from table_t where name = ', '''', @var, '''');

prepare stmt from @st;

execute stmt ;

This is a trivial example but you get the idea of how to use user defined variables. The same technique can be used to build more complex stored procedures and statements that are executed often with different variables.

Cheers,

Shiva

Case Sensitive SQL Search

MySQL on default installation uses the case insensitive configuration meaning the data you stored and searched will not be case sensitive.

Let us first check our current server and database's appropariate variables.

SHOW VARIABLES
WHERE variable_name = 'character_set_server'

> latin1

SHOW VARIABLES
WHERE variable_name = 'collation_database'

> latin1_swedish_ci

Since we are using latin based language settings, the cases are different unlike some languages that don't have cases. And with the database collation set to latin_swedish_ci the string search will be case insensitive (ci).

For this illustration, I have created a table "country" with single column "country_name" and inserted 3 rows.

CREATE TABLE country (
     country_name varchar(32) NOT NULL
);
INSERT INTO country
VALUES ('US'), ('UK'), ('ukraine')
; 
COMMIT;

Google Docs Spreadsheet - Better access control needed

I use Google docs to maintain many documentations for a non-profit organization and this has been a very useful, money & time saving application. In my view, following three features will immediately enhance what is already a excellent collaborative product. These are low-hanging fruits.

User access control at worksheet level to view
Copying more than 1000 (one thousand cells) at a time
Multi-column sorting

In our non-profit organization we have multiple chapters controlling their own documents/ spreadsheets and some times there are situations where I would like to have a single spreadsheet but each work sheet is visible only to a particular chapter or member. For example, a common spreadsheet with 50 worksheets (one for each US state) should appropriately be visible to these chapters. A California chapter can view California worksheet only and not others.

Currently, I can control the editing access through "Protect Sheet" but not the viewing. See below.

Best of Google homepage logos over the years

Google modifies their home page by including birthdays of some major personalities around the world or major events. Here are some of the best, unique and presented in more than one country. The list below is not any particular order.

2009.05.20 - Charles Darwin Birthday

2009.11.13 - Water on Moon found

2009.10.02 - Mahatma Gandhi

Introduction and using MySQL Explain

Like in any database query optimization is critical for MySQL data warehouse environment and having a better understanding of the "Explain plan" helps the database application developer avoiding issues with query performance. Also, DBAs will like your queries and they would be more than happy to help you optimize them.

MySQL's "Explain" statement provides details on query parsing and execution steps and outputs 10 fields -
id, select_type, table, type, possible_keys, key, key_len, ref, rows and Extra columns.

You run the statement by issuing

explain <sql query> #-- Insert your query between < and >.

Shown below is the Toad's output of explain plan of a self-joined table's query with where clause. The table has around 20million rows. The query took less than 0.6seconds to index through ~70K rows and with "const" and a func (date_add function) to compare between one days data to its previous day data.

SELECT a.ad_date, a.unit_id, a.max_cpc, 
               SUM(coalesce(a.max_cpc, a.max_cpc) - coalesce(b.max_cpc, a.max_cpc)) diff_cpc
       FROM  sem_kw_summary a
       LEFT JOIN  sem_kw_summary b
           ON  date_add(a.ad_date, INTERVAL -1 day)  = b.ad_date
           AND a.unit_id = b.unit_id
       WHERE date_add(CURRENT_DATE, INTERVAL -1 day) = a.ad_date
         AND a.engine = 'google'
       GROUP BY a.ad_date,  a.unit_id