Data and Technology

Yahoo SEM isBooksClosed

As a part of Internet Marketing and SEM, you will need to pull daily summary of keyword performance reports from search engines like Google, Yahoo and MSN.

In case of Yahoo, you should check "isBooksClosed" to get the correct data for your report dates; meaning the data you pulled will not change if isBooksClosed is true and Yahoo analytics has collected all the data for the date you requested the report.

This is especially important if you are pulling the previous day data early in the morning as Yahoo may not have had time to get collect and complete the analytics. We have run into many times books not closed condition for few accounts out of many tens of accounts. We then need to repull the data at later time of the day and run the daily summaries.

I have reproduced the part of Yahoo explaination related to this from http://developer.searchmarketing.yahoo.com/docs/V6/gsg/BasicReportService.php#books_closed

Google's extended searches

Internet search users are very well aware of Google search engine and typically they end up entering their query or keywords in the search box and click the resulting links of interest/relevant to them (typically first page results and especially among top 3 o 4 results). Large percentage of users seem to be unaware of Google's extended search that can provide exact result he or she is looking for without having to hop to one more site.

For example, weather for a particular city or current time at a location or dictionary or area code or sports scores and many more. Following are few examples.

For weather: To find current weather at Boston, USA just enter keywords "weather Boston, USA" and you will see today and next few days weather in the first result!

For time: To find current time at Bangalore, India (or Bengaluru, India) enter "time Bengaluru, India" and first result is the time.

For dictionary: use "define" keyword first before entering your query like "define avatar"

Yahoo SEM daily report - Currency and Dups handling

If you are pulling daily keyword performance data from major search engines to store in your in Data Warehouse, you need to perform many of transformation before having a daily, weekly or monthly summaries available for business (end) users. In the case of Yahoo the required transformations that I needed to make are, one to handle the currency conversion and two to handle Yahoo sending more than one keyword when only one is expected. This has seem to happen when query or landing page url is modified for the same bid unit and for high impressions queries. Also, for disabled queries/keywords.

The following sql handles both with data read from staging table and loaded into final table after some transformation and cleaning. It does currency conversion for all countries except US and Mexico (units are in US dollars) which do not need conversions.

Sub-query below uses MySQL GROUP_CONCAT to find any duplicate keywords for the bid unit and dim_curreny has the daily currency exchange rates pulled from external third party services early in the morning. se_query_group table has the grouping of the keywords based on company requirements and search engine. Query also filters out any keyword that has zero impressions.

The query can further be modified for better performance of using query_id by joining with data warehouse query tables and using query_id in the joins also using search engine bid unit as the surrogate key.

Ruby libmysql.dll or mysql_api.so error

Installing Ruby MySQL module is pretty straight forward either on *nix or Windows by running

> gem  install  mysql

But on Windows when you try to import the module through "require mysql" you may get an error like

C:/Ruby/lib/ruby/gems/1.8/gems/mysql-2.8.1-x86-mswin32/lib/1.8/mysql_api.so: 126: The specified module could not be found.   - C:/Ruby/lib/ruby/gems/1.8/gems/mysql-2.8.1-x86-mswin32/lib/1.8/mysql_api.so (LoadError)
#  from C:/Ruby/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require'
...
...

In that scenario, you are missing "libmysql.dll" in the system32 directory. In my case I had MySQL server copy and had the dll in folder C:/Program Files/MySQL/MySQL Server 5.4/lib/opt

Mapping URL to top level domain

When parsing web logs as part of web analytics, one might be interested in the top level domain analysis before drilling down further. Following query maps an url (referer) to top level domain. Top level domain include not only generic ones like '.com', '.edu', '.org', etc but also country level top levels like '.uk', '.de', etc. You can download the list from IANA (Internet Assigned Numbers Authority) .

After downloading it import into a table dim_int_mktg_tld which you can join with the staging data.

CREATE TABLE `dim_int_mktg_tld` (
  `tld` varchar(16) NOT NULL,
  `type` varchar(32) default 'country-code',  #-- Can be country-code or generic
  `info` varchar(500) default NULL,
  PRIMARY KEY  (`tld`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

#--  Join with yesterdays staging data and get TLD
SELECT WL.url, TLD.tld
FROM stg_weblog_date WL
LEFT OUTER JOIN dim_int_mktg_tld TLD
ON substring_index(
     substring_index(
       substring_index(url, '://', -1),
       '/',1), 
    '.',-1) = TLD.tld
WHERE WL.date = DATE_ADD(CURRENT_DATE, interval -1 day)

#-- Example: http://www.abc.co.au/in/test
#-- Innermost substring_index returns everything to the right of '://' 
#--   i.e, www.abc.co.au/in/test,
#-- then the middle one gets everything to the left of first '/'
#--   i.e, www.abc.co.au
#-- then the outer most substring_index returns everything to the right of last '.'
#--   i.e, au
#-- which is then joined with TLD table

HTH,
Shiva

Download Google keywords through SUDS

There are sometimes situations where you may want to download all the keywords (criteria) uploaded to Google Adwords and if you have substantial keyword basket (say, more than a million), then either downloading manually or checking through the tools provided by Google become prohibitive. Also, Google daily keyword performance report provides metrics for all the keywords that have had at least one impression.

Following application is a simple tool that downloads all keywords and dumps that data to different files. I have removed error handling code and hardcoded many values to simplify it and you can modify this your requirements and don't forget to add your account and API details. I have been able to download few million keywords using multiple processes.

Environment:
Ubuntu karmic OS running Python 2.6.4 for Adwords API version 13 (V13) using Python SUDS SOAP module.

fizzbuzz in Ruby, Python, Perl, more

Recently, I ran into couple of articles on codinghorror.com that covered interviews with programmers/ software engineers. Many of the commentators answered the programming (fizzbuzz) question using their favorite computer language like VBA, Ruby, Perl, etc. Since the post was no more allowing any new comments, I thought of posting them here on my blog itself.

fizzbuzz question: For all numbers between 1 and 100, print 'fizz' any number divisible by 3, 'buzz' for number divisible by 5 and 'fizzbuzz' for number divisible by both 3 & 5.

Article 1: Why Can't Programmers.. Program?
Article 2: The Non-Programming Programmer

Adding disk space to Ubuntu (guest OS)

Recently, in one of my VMWare system (with Windows XP host OS and Ubuntu guest OS), I ran into disk space shortage while installing a package (Thrift - A cross language service). Though there were many articles on the Internet in helping extending the guest OS partitions, they were for older versions or not applicable any more or had many complex steps. On spending some time and digging more on the issue, I found pretty straight forward steps in achieving what I wanted to do - doubling the disk space for Ubuntu.

Initial setup:
Host OS: Windows OS
Guest OS: Ubuntu Karmic (had 8G pre-allocated)
Disk: 148G Total and 60G Free out of which 8G to be allocated to Ubuntu

Needed:
Double Ubuntu disk space to 16G by adding a new partition

Step 1:

Run "disk analyzer" and get current disk space usage data that provides total, used, available and usred. See below. The system has less than 10% free disk space and needs more.

MySQL SP parameter for IN clause

Recently, I came across a question on stackoverflow.com where the poster indicated that he was not able to use the parameter string as part of stored procedure IN clause to get the right results and the result set always returned one row only.

Test table used for procedure:

CREATE TABLE cities (
  id int(10) unsigned NOT NULL auto_increment,
  name varchar(100) NOT NULL,
  PRIMARY KEY  (`id`)
);
insert into cities (name) values 
 ('London'), ('Manchester'), ('Bristol'), 
 ('Birmingham'), ('Brighton');

Log Parsing through Hadoop, Hive & Python

One of the primary analysis done on web access logs is some cohort analysis where one need to pull user access date time and along with other dimensions like user, ip, geo data, etc. Here I will be using Hadoop/ Hive/ Python to pull date, ip data from access log into Hadoop and run some queries. The example illustrates using Hadoop (version 0.20.1) streaming, SERDE, Hive’s (version 0.40) plugin customer mapper (get_access_log_ip).

The steps below load few thousand rows into a target table (dw_log_ip_test – data warehouse access log) “access_log_2010_01_25” then extract date from format like DD/Mon/YYYY:HH:MM:SS -800 to ‘DD/Mon/YYYY’ along with remote ip address through a Python streaming script.

Step 1: First create a table to access log (access_log_2010_01_25) and then load data into it.

hive> 
CREATE TABLE access_log_2010_01_25 (
  request_date STRING,
  remote_ip STRING,
  method STRING,
  request STRING,
  protocol STRING,
  user STRING,
  status STRING,
  size STRING,
  time STRING,
  remote_host STRING,
  ts STRING,
  perf STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
"input.regex" = "\\[([^]]+)\\] ([^ ]*) \"([^ ]*) ([^ ]*) ([^ \"]*)\" user=([^ ]*) status=([^ ]*) size=([^ ]*)  time=([^ ]*) host=([^ ]*) timestamp=([^ ]*) perf=([^ ]*)",
"output.format.string" = "%1$s %2$s \"%3$s %4$s %5$s\" user=%6$s status=%7$s size=%8$s  time=%9$s  host=%10$s timestamp=%11$s  perf=%12$s"
)
STORED AS TEXTFILE;


hive> LOAD DATA LOCAL INPATH '/mnt/web_ser101/weblog_server101_20100125_1'   
    >   OVERWRITE INTO TABLE access_log_2010_01_25;
#- After load the data in one of the record would look like:
#- 25/Jan/2010:13:14:05 -0800      123.123.123.123   GET     /xmls/public/thumbnail.xml   HTTP/1.1        -       302     250     0  abcd.com   1264454045    -

Hive Metastore Derby DB

If you are using Hive in default mode, you may see the following behavior - you get to hive client from different directories and see different results when you run a query like "show tables". For example, you have hive installed in /usr/local/hive and your are currently in your home directory and run

~> /usr/local/hive/bin/hive #-- get to hive
hive> create table new_table (c1 string);
hive> show tables;

Now you will see "new_table" in the list.

~> cd /tmp
/tmp> /usr/local/hive/bin/hive #-- get to hive
hive> show tables;

Now you don't see "new_table" in your list of tables. Those who come from typical SQL background may find it little weird in the beginning due to fact that results seem different depending on from where you started the hive client. The reason is because hive uses "embedded Derby" database to store meta data and one of the default configuration property is to use the current directory to store metastore_db.

On starting the hive from two different directories like above, one would see two "metastore_db" directories created in home (~) and /tmp directories. You can change this and use one metastore_db by updating "/usr/local/hive/conf/hive-default.xml" file's "javax.jdo.option.ConnectionURL" as below.

Default setting:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

Update it to:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/hadoop/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

"/home/hadoop" is an example and one can appropriately change it to suitable host and directory. Say,
<value>jdbc:derby:;databaseName=//localhost:1234/metastore_db;create=true</value>

Cheers,
Shiva

Installing MySQL Python Package on Windows

I work in *nix environment but sometimes you may want to test some features on your own Windows desktop either due to access constraints on test or development systems. For example, company not allowing to install the latest packages or non-standard tools or you may be lazy and have setup your own little world on a desktop to do your tests, etc. Anyway, I needed to install Python MySQL package and installing that package is not as straight forward as installing it on Linux systems.

> python setup.py install

The above one typically takes care of it in *nix environment. But in windows you may see following error
"Unable to find vcvarsall.bat" and if so you should download and install Microsoft Visual C++ Express Edition. Installing whole Visual SDK kit seem like like a overkill for getting one .BAT file.

Now to install MySQL Python package, download it from sourceforge , gunzip it, untar it and make the following change before running install.

Hadoop Cheat Sheet

Cheat sheet for Hadoop version 0.20.1 (Jan.2010).

Data loading through Talend ETL Studio

In couple of my earlier posts I talked about loading data pulled from search engines and then loading into staging tables of a data warehouse. You can load data in couple of ways including databases bulk load tools - for example, MySQL "LOAD DATA INFILE " statement or using Talend's ETL studio or similar tools (Pentaho, Informatica, etc.)

In this article, I will be loading a simple data file with more than 200,000 (200K) rows of keyword performance data for illustration purpose and the setup used was
MySQL 5.1 (Community version)
Talend Open Studio (3.2.2)
Perl 5.10.1

Other posts that may be of interest:
1. URL parsing in Python, Perl, MySQL - an example
2. Google Adwords Data Retriever
3. Google Analytics Data Retriever

The data flows through 6 components (tFileInputDelimited, tPerlRow, tSortRow, tFileOutputDelimited, tFilterRow and tMysqlOutput) starting at input data file read and transformed into another output file and then loaded into database. Component #3 (sorting component) is not needed but shown here to stress the ease with which a developer can quickly put together a ETL process. Palette tab on the right hand side houses these components and many more. If you use Java instead of Perl, you will likely have more components to use and for list of all available components take at Talend Forge.

Input record format is a tab seperated fields with one record per line and so is output file in component 4.

# Typical fields in the file are date, account_id, campaign_id, ad_group_id, keyword_id, CPC, CTR, Cost, Impressions, url etc.
# 2009-12-12   9876543210   34987650   1147382910   1234567890  0.07  
# 0.226632  247.98   15444  http://hiregion.com/2009/10/google-analytics-data-retriever-api-sem_25.html

I provide the following regular expression to parse each URL to extract domain name in component 2.
/(https?://)?(www\.)?([a-zA-Z0-9_%]*)\b\.[a-z]{2,4}(\.[a-z]{2})?/ig

And this parses any URL with correct formats like http://www.amazon.com or https://amazon.com or http://www.amazon.co.uk or amazon.com and more.

Component 4 sorts the input stream coming from the regexp output and dumps into a output file which is then read into a filter which filters out any record that doesn't have the primary key before loading into staging database. You can load into database without having the primary key constraint to further speed it up and cleanup the duplicates before pushing into either dimensions or facts tables.

The whole run took little over a minute (~63 seconds) running on a test system running all three (MySQL, Talend, Perl). For daily load, you can make use of scheduler to kick-off the job and automate the loading process.

Hope that helps,
Shiva

URL parsing in Python, Perl, MySQL

There are many situations when one needs to parse the URL. It could be in web log parsing or Google Keyword or other performance data that is downloaded through APIs. Application languages like Python, Perl, PHP etc provide many modules to parse or you can use raw regular expression to parse the URLs. Below I have shown few ways of doing it with a simple example - either in applications before loading into staging area of data warehouse or parsing it in SQL (MySQL) after the load.

In the example url http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html, I would like to get sub-domain name and file name parts i.e; www.hiregion.com and sem-google-aswords-api-data-retriever.html.

Other posts that may be of interest:
1. Regular expressions - An intro
2. 99-bottles of bear in many languages

import time
import re

SEM - Google Adwords API Data Retriever

If you are running many large campaigns or thousands of campaigns and want to track keyword performance of Google Adwords, the following application helps retrieve the data on regular intervals (daily, weekly, monthly, etc.). Using your data warehouse and reporting infrastructure you can slice and dice the data based on your business requirements.

Application is a multi-threaded implementation of pulling data through API. Each thread gets a campaign id to pull keyword performance report. The code included here is a skeletal code and you need to add error handing, code to suit your database schema and some ETL. In my environment I have used MySQL 5.0, linux (2.6) and Windows, Perl 5.10.1 and Google Adwords API v13. You can also use Google's client libraries available in many languages including Java, Python, DotNet, Perl, PHP and others.

Other posts that may be of interest:
1. Google Analytics Data Retriever
2. Bulk data loading using Talend
3. Google Keyword Performance data pull

Campaign Class:

package Campaign;

#-------------------------------------
# Campaign Package
#-------------------------------------

use strict;
use DBI;

sub new {
    my $self = {};
    $self->{campIds} = [];  #-- Master campaign ID array

    bless($self);

    $self->_populateCampaignIds();
    return $self;
}

sub _populateCampaignIds {
    my $self = shift;

    #--------------------------
    #- Connect to database to get list of campaigns (IDs).
    #--------------------------
    #- my $dbh = DBI->connect($data_source, $user_name, $pw);
    #- my $sth = $dbh->prepare(
    #-                  "SELECT campaign_id from campaign
    #-                   WHERE  search_engine = 'google'
    #-                     AND  status = 'active';
    #-                  ";
    #- $sth->execute() or die "Unable to execute a query.\n";
    #- while ( @id = $sth->fetchrow_array ) {
    #-     push (${$self->{campIds}}, $id[0]);
    #- }
    #-
    #- $dbh->disconnect;

    #---- For demo, I have hard-coded some IDs.
    push( @{$self->{campIds} }, '1111');
    push( @{$self->{campIds} }, '2222');
    push( @{$self->{campIds} }, '3333');

    return 1;
}

#----------------------------
#- Returns master list of campaign IDs
#----------------------------
sub getCampaignIds {
    my $self = shift;

    return @{ $self->{campIds} };
}

#----------------------------
#- Returns chunk of campaign IDs from master array.
#----------------------------
sub getCampaignIdChunk {
    my $self = shift;
    my $chunks = shift;
    my $index = shift;

    if (!defined $index and !defined $chunks) { 
        return undef; 
    }

    my $chunkSize = int (scalar(@ {$self->{campIds}})/$chunks);
    if ($chunkSize < 1) {
        print "Error: no campaigns to process for thread $index\n";
        return undef;
    }
    my $lIx = $chunkSize * ($index - 1);  #-- low index
    my $hIx = $chunkSize * ($index);      #-- high index

    my @campIds = ();
    if ($index < $chunks) {
        @campIds = splice (@ {$self->{campIds}}, $lIx, $chunkSize);
    } else {
        #- print "Getting LAST chunk for $index from lower $lIx element and a chunk size of $chunkSize\n";
        @campIds = splice (@ {$self->{campIds}}, $lIx);
    }

    return @campIds;
}

1;

Prototype application:

#-------------------------------------------------
# Adwords Data Retriever
#   This is skeleton application to pull Adwords Keywords Performance Report Data
#   from Google through Adwords API.
#   Uses multi-threaded programming technique for simultaneous pull of many campaigns. 
#   Campaign ids in the application are hard-coded with dummy values and hence replace them by connecting
#     to your database and getting actual campaign ids (code shown).
#   Also, for authentication use your user name, password, developer & application code.
# Valid for Google Adwords v13.  
# Author: Shiva M.
#-------------------------------------------------

#- use lib "Add directory where you installed Campaign.pm";
use strict;
use Benchmark;
use Getopt::Std;
use Data::Dumper;

use XML::Simple;
use LWP::Simple;
use Crypt::SSLeay;
use SOAP::Lite;
use IO::Uncompress::Gunzip qw();

use threads;
use threads::shared;

use Campaign;

my (%opts, @worker) = undef;
my $NUM_THREADS = 3;     #-- Modify this to suit your environment. Or better use Config module with ini file to setup application configuration.
my $GOOGLE_API = 'https://adwords.google.com/api/adwords/v13';

#-------------------------------------------------
# Main program.
#-------------------------------------------------
getOpts();

if ($opts{'s'} == 1) {
    for (my $i = 1; $i < ($NUM_THREADS+1); $i++) {
        print "MASTER: Starting thread " . $i, "\n";
        $worker[$i] = threads->new(\&pullCampaignData, $i);
        sleep 5;
    }

  while(anyRunningThread()) {
     for (my $ii = 1; $ii < scalar(@worker); $ii++) {
         my $thr = $worker[$ii];
         if ($thr->is_joinable) {
            my $ret = eval { $thr->join() };
            print "Thread $ii returned: $ret\n";
         }
     }
     sleep(10);
   }
}

exit (0);

#-------------------------------------------------
# Subroutines.
#-------------------------------------------------
sub getOpts {
    if (!getopts('s:d:', \%opts)) {
        print "Invalid option. Exiting\n";
        exit (-1);
    }

    return 1;
}

#-----------------------------
#- Any thread still running?. Returns Yes/No (1/0)
#-----------------------------
sub anyRunningThread {
    my $runThrs = threads->list(threads::running);
    print "Still $runThrs threads are running.\n";
    if ($runThrs) { return 1; }

    return 0;
}

#---------------------------
#- Each thread to pull keyword performance reports for a list of campaigns
#---------------------------
sub pullCampaignData {
    my ($tId) = @_;
    my $thr = $worker[$tId];

    my $to_dollar_conv_factor = 1 /(1000 * 1000);  #-- Currency in micros. $1 <=> 1000 * 1000;

    my $log = "thread_$tId.log";
    open (TFH, ">$log") or die "Unable to open file $log\n";   #-- Log file for the thread.

    #---------------------------------
    #- Get this thread share of load - list of campaign IDs
    #---------------------------------
    my $campaign = Campaign->new();
    my @tCampIds = $campaign->getCampaignIdChunk($NUM_THREADS,$tId);

    print TFH "\npullCampaignData \$tId: $tId: campaignIDs:", map {"$_ "} @tCampIds;

    my @headers = (SOAP::Header->name('email'            => $email),
                   SOAP::Header->name('password'         => $pw),
                   SOAP::Header->name('useragent'        => $ua_name),
                   SOAP::Header->name('developerToken'   => $developer_token),
                   SOAP::Header->name('applicationToken' => $application_token)
                  );

    my $url = sprintf('https://%s.google.com/api/adwords/v13/ReportService', 'adwords');
    my $wsdl = $url . '?wsdl';
    my $service = SOAP::Lite->service($wsdl)->autotype(0)->readable(1)->proxy($url);
    print TFH "\ndebug:service=$service" if $opts{d};

    $service->on_fault(
                sub {
                     my $response = $::ARG[1];
                     die('The following SOAP fault occurred:', "\n", 'faultcode: ',
                         $response->faultcode(), "\n",
                         'faultstring: ', $response->faultstring(),
                         "\n");
                }
    );
    SOAP::Lite->import(+trace => 'debug') if $opts{d};


    #---------------------------
    # Make SOAP call.
    # Get one campaign data at a time. Generate a file to bulk load into database.
    #---------------------------

    foreach my $cId (@tCampIds) {

        my $AWJobParams = {
                  'selectedReportType' => 'Keyword',
                  'name'               => 'AW Keyword Perf Daily Report',
                  'aggregationTypes'   => ['Daily'],
                  'startDay'           => $yesterday_date,
                  'endDay'             => $yesterday_date,
                  'selectedColumns'    => [ qw (
                     KeywordId Keyword CampaignId CampaignStatus
                     AdGroupId AdGroupStatus AdGroupMaxCpa AdWordsType
                     Impressions Clicks CTR CPC Conversions ConversionRate Cost
                     AveragePosition QualityScore AverageConversionValue BottomPosition
                     CPM CostPerConverstion CostPerTransaction DailyBudget DefaultCount
                     DefaultValue FirstPageCpc KeywordDestUrlDisplay KeywordStatus
                     KeywordTypeDisplay LeadCount LeadValue MaxContentCPC MaximumCPC
                     MaximumCPM PageViewCount PageViewValue SalesCount SalesValue
                     TopPosition TotalConversionValue Transactions ValuePerClick
                     ValuePerCost CustomerName ExternalCustomerId )  
                                          ],
                  'crossClient'        => 1,
                  'campaigns'          => [$cId]
        };


        #------------------------------
        # Step 1. Define a job
        #------------------------------
        my $report_job = SOAP::Data->name('job' => $AWJobParams);
        $report_job->attr({'xsi:type' => 'DefinedReportJob', 'xmlns' => "$GOOGLE_API"});

        #------------------------------
        # Output XML file after unzipped.
        #------------------------------
        my $output = "${cId}_kw_perf.xml";


        #-------------------------------
        # Catch any exception
        #-------------------------------
        eval {

            #------------------------------
            # Step 2. Validate report.
            #------------------------------
            $service->call('validateReportJob' => $report_job, @headers);
            print TFH "debug: about to schedule report \n";

            #------------------------------
            # Step 3. Schedule report.
            #------------------------------
            my $job_id = $service->call('scheduleReportJob' => $report_job, @headers)->result();
            $job_id = SOAP::Data->name('reportJobId' => $job_id);

            #------------------------------
            # Step 4. Get report status. Wait until finish or error.
            #------------------------------
            my $status = $service->call('getReportJobStatus' => $job_id, @headers)->result();
            while ($status ne 'Completed' and $status ne 'Failed') {
                print TFH "Report job status is $status\n";
                sleep(30);
                $status = $service->call('getReportJobStatus' => $job_id, @headers)->result();
            }
            if ($status eq 'Failed') {
                die("Report job generation failed.\n");
            }

            #------------------------------
            # Step 5. Get report URL to download the file.
            #------------------------------
            my $report_url = $service->call('getGzipReportDownloadUrl' => $job_id, @headers)->result();
            print TFH "Report is available at $report_url\n" if $opts{d};

            #------------------------------
            # Step 6. Get the data file (report).
            #------------------------------
            my $gzip_report_data = LWP::Simple::get($report_url);


            #------------------------------
            # Step 6a. Save it to local file.
            #------------------------------
            my $localFile = "${cId}_kw_perf.xml.gz";
            open(LFH, ">$localFile") or die "Could not open $localFile: $!\n";
            print LFH $gzip_report_data . "\n";
            close LFH;


            #------------------------------
            # Step 7. Unzip the file and build xml hash.
            #------------------------------
            IO::Uncompress::Gunzip::gunzip(\$gzip_report_data => $output);

            my $xml_hash = XMLin($output);
            print TFH Dumper($xml_hash) if $opts{d};

            #------------------------------
            # Step 8. Create database bulk load file.
            #------------------------------
            open (BLFH, ">${cId}_bulk_load_input.txt") or die "Unable to create bulk load file: $!\n";


            #-- Uncomment the fields that are of interest to you.
            foreach my $row (@{$xml_hash->{table}->{rows}->{row}}) {
                print BLFH
                    $row->{keywordid},"\t", $row->{keyword},"\t",
                    $row->{campaignid},"\t",
                    # $row->{campaign},"\t",
                    $row->{campStatus},"\t", $row->{adgroupid},"\t",
                    # $row->{adgroup},"\t",
                    # $row->{agstatus},"\t",
                    # $row->{agMaxCpa} * $to_dollar_conv_factor,"\t",
                    # $row->{adwordsType}, $row->{keywordid}, $row->{keyword}, $row->{kwStatus},
                    # $row->{customerid}, $row->{acctname}, $row->{kwDestUrl},
                    $row->{imps},"\t", $row->{clicks},"\t", $row->{ctr},"\t",
                    # $row->{cpc} * $to_dollar_conv_factor, $row->{conv},
                    # $row->{convRate},
                    $row->{cost} * $to_dollar_conv_factor,"\t",
                    $row->{pos},"\t",
                    $row->{qualityScore},"\t",
                    # $row->{avgConvValue} * $to_dollar_conv_factor,
                    # $row->{bottomPosition}, $row->{cpm} * $to_dollar_conv_factor,
                    # $row->{costPerConv} * $to_dollar_conv_factor,
                    # $row->{cpt} * $to_dollar_conv_factor,
                    # $row->{budget} * $to_dollar_conv_factor, $row->{defaultConv},
                    # $row->{defaultConvValue} * $to_dollar_conv_factor,
                    # $row->{firstPageCpc} * $to_dollar_conv_factor,
                    $row->{kwDestUrl},"\t", $row->{kwType},
                    # $row->{leads}, $row->{leadValue} * $to_dollar_conv_factor,
                    # $row->{maxContentCpc} * $to_dollar_conv_factor,
                    # $row->{maxCpc} * $to_dollar_conv_factor,
                    # $row->{maxCpm} * $to_dollar_conv_factor, $row->{pageviews},
                    # $row->{pageviewValue} * $to_dollar_conv_factor,
                    # $row->{purchases},
                    # $row->{purchaseValue} * $to_dollar_conv_factor,
                    # $row->{topPosition},
                    # $row->{convValue} * $to_dollar_conv_factor,
                    # $row->{transactions}, $row->{convVpc},
                    # $row->{valueCostRatio},
                    "\n";
            }
            close(BLFH);
        };

        if ($@) {
            print("Report job failed validation due to the following error: ", $@, "\n");
            exit (-1);
        }
    }

    close(TFH);
    return 1;
}

Bulk loading into data warehouse staging and perform summarization.

#- Psuedo code
#-  1. Cleanup or perform any needed transformation of data in the application.
#-  2. Concatenate all the campaign keyword performance data to a file(s).
#-  3. Bulk load file into a staging table.
#-  4. Transform and summarize the data through SQL.

Enjoy,
Shiva

Adwords Keyword Performance data pull

If you are handling multiple accounts or a large account at Adwords center managing few tens of thousand of keywords to millions of keywords, you would be interested in tracking the keyword performance over a period of time. You can do so by pulling daily summary report from Google through Adwords API and storing the data in your data mart or data warehouse for further analysis.

Pulling the data through API involves 6 steps:
   1. definedReportJob object - Specifies the type of report you want.
   2. validateReportJob - Validate the report that you would like to get. Returns nothing.
   3. scheduleReportJob - Now schedule the job for above object. This request returns an ID.
   4. getReportJobStatus - Get status of the report. If "Completed" go to next.
   5. getReportDownloadUrl - Retrieve the URL where your report is stored. The returned URL is valid for five minutes. If the URL expires, just call getReportDownloadUrl again. Keep this URL private, since anyone can use it to access your report.
   6. getGzipReportDownloadUrl - Download your report by sending an HTTP GET request.

Following pseudo code and code explains each step execution in more detail. I use Perl SOAP calls to generate Adwords requests. Also, note the following code works well with API version v13 and Google is transitioning many of its Adwords API to v2009. As of Dec.2009 ReportService was still available only in v13.

Google NS Individual Search Box

Today, while searching for "cygwin" keyword I was presented with results that caught my eye. The first result had its own search box. See picture below. Had anyone seen this before? I liked the feature and pretty handy if you are looking for a specific information from a specific site. I entered "latest version" and hit "Search cygwin.com" that showed search entry "latest version site:cygwin.com" in the next search. Nice.

Other posts that may be of interest:
1. Best Google home page logos
2. Better quicker SEM ROI
3. 99 bottles of beer in different laguages

Cheers,
Shiva

Forecasting SEM ROI with Eureqa

There is a is powerful tool to detect equations and hidden relationships among the data sets. What makes it irresistible tool is its cost - FREE!. It is Eureqa. In this blog I will explain how to find relationship between cost and long term revenue for a keyword campaign in Google Adwords. This is a simplied version of identifying a relationship at macro level and then drilling down further to identify more details.

The sample I have taken is over two months summarized data at daily level for a campaign. Since cost and revenue are affected by other factors including the keywords or ad words that comprised a campaign, competitors' bidding, landing page or page optimizations, etc., it is important to note that the following is to get a quick insight into how cost is related to revenue for a particular campaign. This is one of important arsenal out of few to get better positive ROI.

The data set is shown below in the application's tab 1(Enter Data). This very much looks like Excel spreadsheet!. First column is cost and second is long term revenue.

Related Posts: Generating quick positive SEM ROI

Quick ROI for SEM (Adwords) Defined

SEM or Internet paid advertisement or PPC or Adwords or what ever one may call, has been on high growth rate for many years now and has shown that it will continue to grow in the near future with the potential to grab market share from other advertisement channels. This has implications on all companies that would like to have their products or services advertised. By quickly responding to the market changes and being nimble, they can effectively spend their marketing dollars with positive ROI.

Though I use Google Adwords in the following explaination, you can replace the same logic for Yahoo and MSN. I assume that you are somewhat familiar with natural search, paid search, clicks, impressions and related; and what they mean.

Related posts: Quick ROI forecasting technique