The steps below load few thousand rows into a target table (dw_log_ip_test – data warehouse access log) “access_log_2010_01_25” then extract date from format like DD/Mon/YYYY:HH:MM:SS -800 to ‘DD/Mon/YYYY’ along with remote ip address through a Python streaming script.
Step 1: First create a table to access log (access_log_2010_01_25) and then load data into it.
hive> CREATE TABLE access_log_2010_01_25 ( request_date STRING, remote_ip STRING, method STRING, request STRING, protocol STRING, user STRING, status STRING, size STRING, time STRING, remote_host STRING, ts STRING, perf STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "\\[([^]]+)\\] ([^ ]*) \"([^ ]*) ([^ ]*) ([^ \"]*)\" user=([^ ]*) status=([^ ]*) size=([^ ]*) time=([^ ]*) host=([^ ]*) timestamp=([^ ]*) perf=([^ ]*)", "output.format.string" = "%1$s %2$s \"%3$s %4$s %5$s\" user=%6$s status=%7$s size=%8$s time=%9$s host=%10$s timestamp=%11$s perf=%12$s" ) STORED AS TEXTFILE; hive> LOAD DATA LOCAL INPATH '/mnt/web_ser101/weblog_server101_20100125_1' > OVERWRITE INTO TABLE access_log_2010_01_25; #- After load the data in one of the record would look like: #- 25/Jan/2010:13:14:05 -0800 123.123.123.123 GET /xmls/public/thumbnail.xml HTTP/1.1 - 302 250 0 abcd.com 1264454045 -