Showing posts with label Weblog. Show all posts
Showing posts with label Weblog. Show all posts

Hourly web log analysis through Hadoop

Many a time one may want to parse the weblogs for doing some quick analysis on AB tests or for security/fraud alerts or recent advertisement or campaign analysis. There are many applications or utilities that perform web log analysis but more often than not regular expressions provide a powerful and elegant ways to analyse these logs and especially handy when one is dealing with massive and quickly rotating web logs.   Check out this wiki for more general info on web analytics.

When each weblog is multi-gigabytes and moved to archive every couple of hours and there is farm of web servers (hundreds or thousands of web servers), many of the vendor or 3rd party applications don't scale up either. Hadoop streaming with simple utilities can provide insights into what otherwise would have been a costly experiment.

Here is an regular expession used to extract HOUR and the string that is of interest from the Apache web logs. Each entry in the log has the format similar to the ones below.

01/Jun/2010:07:09:26 -0500] - 127.0.0.1 "GET /apache_pb.gif HTTP/1.1" status=200 size=2326 ab=test1AB ....

01/Jun/2010:07:09:26 -0500] - 127.0.0.1 "GET /apache_pb.gif HTTP/1.1" status=200 size=2326 ab=test-2CD ....


For more details on Apache log configuration, take a look at its specs and custom logs.  Along with time of the day of the site hit, user IP address, http request (GET or POST), page requested, protocol used, one can configure the web server to log many more details including referer, user agent (browser), environment variables, etc.