When each weblog is multi-gigabytes and moved to archive every couple of hours and there is farm of web servers (hundreds or thousands of web servers), many of the vendor or 3rd party applications don't scale up either. Hadoop streaming with simple utilities can provide insights into what otherwise would have been a costly experiment.
Here is an regular expession used to extract HOUR and the string that is of interest from the Apache web logs. Each entry in the log has the format similar to the ones below.
01/Jun/2010:07:09:26 -0500] - 127.0.0.1 "GET /apache_pb.gif HTTP/1.1" status=200 size=2326 ab=test1AB ....
01/Jun/2010:07:09:26 -0500] - 127.0.0.1 "GET /apache_pb.gif HTTP/1.1" status=200 size=2326 ab=test-2CD ....
For more details on Apache log configuration, take a look at its specs and custom logs. Along with time of the day of the site hit, user IP address, http request (GET or POST), page requested, protocol used, one can configure the web server to log many more details including referer, user agent (browser), environment variables, etc.
m/(?<=\d{4}:)(\d{2})(?=:\d{2}:\d{2}\s+[-\d{4}]).*?ab=\D+(\d{1}[a-zA-Z]{2,})\b/;
Though the above regular expression may look cryptic, it is straight forward to extract hour and string of interest to us.
(?<=\d{4}:) => Matches any four digit string (in this case year) FOLLOWED BY ":" and two digits (HOUR) . The beginning "(?<=" is a positive-look-behind assertion engine anchors to look for. If anywhere in the log line two digits appear without preceded by four digits (year) and colon (:), then it is a non-match.
(\d{2}) => Matches two digit string (HOUR). The parenthesis around the two digits activates storing these two digits in a special variable for future use. As I would like to summarize at the hourly level, I concatenate these digits with other matched string (1AB) and create simple key=value pair. Value is simply an identity (1) which later then passed to hadoop reducer to count.
(?=:\d{2}:\d{2}\s+[-\d{4}]) => Similar to positive-look-behind, this tells the regular expression engine to look-ahead for positive match from the 2 digit hour string. "(?=" indicates from the current position look-ahead and get colon (:) followed by 2 digits (minutes), another colon (:) 2 digits (seconds) further followed by one or more spaces (\s+).
[-\d{4}] => Tries to matches optional ([ ]) "- and 4 digits". This is the token to match time zone (-5000 above).
Once the string string is matched, and hour is extracted by the engine, it then
.*? => matches any character (denoted by ".") followed by any thing by zero or more (denoted by "*") other characters in non-greedy mode (denoted by "?") until it gets to "ab=" string. Engine actually traces all the way to the end of string and then backtracks it to find "ab=" to meet the minimal match of ".*?".
ab= => environment variable that is of interest to us followed by
\D+ => any non-digit character followed by one or more times until it trys to match
(\d{1}[a-zA-Z]{2,}) => a string to match that has single digit followed by two or more characters (a to z, lower or upper case) and store this match for future use until
\b => a word boundary. A boundary between a word character that includes "a to z, 1 to 9 and _". You can visualize this as a thin line-boundary between word characters and any non-word characters like "." or "-" or "$", etc.
In the parser application, you can concatenate first match (Hour - 07) and second matched string (Test string - 1AB) like "1AB_07" and set the value to "1".
When the Hadoop mapper passes this to reducer, reducer simply has to count the 1s and to get sum of hourly hits of "1AB". You can then load this data to some table at (date and) hourly level and maintain its history for further analysis. Similar analysis on web logs can yield IP level, page level and many other useful info.
Note: While using regular expression it helps to understand how the backtracking technique works and its implications on performance. Backtracking happens when one uses quantifiers like "*, *?, +, +?, {n,m}, and {n,m}?". Take a look at interesting articles here and here.
Thank you for the useful article. It has helped a lot in training my students. Keep writing more.
ReplyDeletebest hadoop training institute in Chennai
Excellent post!!!. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.
ReplyDeleteAndroid training in Chennai | Android course in Chennai
Good info! Thanks for the great share sounds great to hear about the techs, newbie though, this article helps in Mobile application development in chennai
ReplyDeleteperde modelleri
ReplyDeleteNumara onay
mobil ödeme bozdurma
NFT NASIL ALİNİR
Ankara evden eve nakliyat
Trafik sigortası
dedektör
web sitesi kurma
ASK KİTAPLARİ
Good content. You write beautiful things.
ReplyDeletetaksi
vbet
vbet
sportsbet
hacklink
sportsbet
hacklink
mrbahis
korsan taksi
Good text Write good content success. Thank you
ReplyDeleteslot siteleri
mobil ödeme bahis
betmatik
bonus veren siteler
kralbet
tipobet
kibris bahis siteleri
poker siteleri
görükle
ReplyDeletesinop
bodrum +
van
sultanbeyli
OL4G1Q
yalova
ReplyDeleteartvin
balıkesir
tuzla
kayseri
RF3
salt likit
ReplyDeletesalt likit
İPL2
شركة تنظيف مجالس بالدمام Bl6Tdl6KVG
ReplyDeleteشركة مكافحة النمل الابيض بالجبيل KQLiQ7zYyn
ReplyDelete