Hourly web log analysis through Hadoop

Many a time one may want to parse the weblogs for doing some quick analysis on AB tests or for security/fraud alerts or recent advertisement or campaign analysis. There are many applications or utilities that perform web log analysis but more often than not regular expressions provide a powerful and elegant ways to analyse these logs and especially handy when one is dealing with massive and quickly rotating web logs.   Check out this wiki for more general info on web analytics.

When each weblog is multi-gigabytes and moved to archive every couple of hours and there is farm of web servers (hundreds or thousands of web servers), many of the vendor or 3rd party applications don't scale up either. Hadoop streaming with simple utilities can provide insights into what otherwise would have been a costly experiment.

Here is an regular expession used to extract HOUR and the string that is of interest from the Apache web logs. Each entry in the log has the format similar to the ones below.

01/Jun/2010:07:09:26 -0500] - 127.0.0.1 "GET /apache_pb.gif HTTP/1.1" status=200 size=2326 ab=test1AB ....

01/Jun/2010:07:09:26 -0500] - 127.0.0.1 "GET /apache_pb.gif HTTP/1.1" status=200 size=2326 ab=test-2CD ....


For more details on Apache log configuration, take a look at its specs and custom logs.  Along with time of the day of the site hit, user IP address, http request (GET or POST), page requested, protocol used, one can configure the web server to log many more details including referer, user agent (browser), environment variables, etc.

Conversation Prism - An Image

As the social media, social networking, advertising, Internet marketing continue to evolve with new technologies and many companies create their own social groups, it all gets more complex and confusing.  Many a times a picture or an image will explain more elegantly than 1000 words are more and in some cases image is the most suited tool to explain.  Here is one image created by Brian Solis & Jesse Thomas that I like in this conversation!