The steps below load few thousand rows into a target table (dw_log_ip_test – data warehouse access log) “access_log_2010_01_25” then extract date from format like DD/Mon/YYYY:HH:MM:SS -800 to ‘DD/Mon/YYYY’ along with remote ip address through a Python streaming script.
Step 1: First create a table to access log (access_log_2010_01_25) and then load data into it.
hive>
CREATE TABLE access_log_2010_01_25 (
request_date STRING,
remote_ip STRING,
method STRING,
request STRING,
protocol STRING,
user STRING,
status STRING,
size STRING,
time STRING,
remote_host STRING,
ts STRING,
perf STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "\\[([^]]+)\\] ([^ ]*) \"([^ ]*) ([^ ]*) ([^ \"]*)\" user=([^ ]*) status=([^ ]*) size=([^ ]*) time=([^ ]*) host=([^ ]*) timestamp=([^ ]*) perf=([^ ]*)",
"output.format.string" = "%1$s %2$s \"%3$s %4$s %5$s\" user=%6$s status=%7$s size=%8$s time=%9$s host=%10$s timestamp=%11$s perf=%12$s"
)
STORED AS TEXTFILE;
hive> LOAD DATA LOCAL INPATH '/mnt/web_ser101/weblog_server101_20100125_1'
> OVERWRITE INTO TABLE access_log_2010_01_25;
#- After load the data in one of the record would look like:
#- 25/Jan/2010:13:14:05 -0800 123.123.123.123 GET /xmls/public/thumbnail.xml HTTP/1.1 - 302 250 0 abcd.com 1264454045 -
Step 2: Create a target test table
hive> CREATE TABLE dw_log_ip_test (dt string, remote_ip string);
Step 3: In an editor of your choice, build a simple Python script (get_access_log_ip.py) that gets “date” string from “date/time” string and “remote_ip” address as below.
#!/usr/bin/python
import sys
for line in sys.stdin.readlines():
line = line.strip()
fields = line.split('\t')
dt = fields[0].split(':')[0] #-- Get date 25/Jan/2010
ip = fields[1] #-- Get remote IP
print dt,"\t",ip
Step 4: Now extract data to dw_log_ip table and load only some limited data (10 seconds data)
hive> FROM access_log_2010_01_25 L
> INSERT OVERWRITE TABLE dw_log_ip MAP L.request_date, L.remote_ip
> USING '/home/hadoop/hive/etl/scripts/get_access_log_ip.py' AS dt, remote_ip
> WHERE L.request_date > '25/Jan/2010:13:11:40'
> and L.request_date < '25/Jan/2010:13:11:50';
# Hive outputs some information like:
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Ended Job = job_201001251623_0094, Tracking URL = http://hadoop_main:50030/jobdetails.jsp?jobid=job_201001251623_0094
Kill Command = /usr/local/hadoop/bin/hadoop job -Dmapred.job.tracker=hdfs://hadoop_main:9001 -kill job_201001251623_0094
2010-02-03 18:42:40,793 Stage-1 map = 0%, reduce = 0%
2010-02-03 18:42:53,874 Stage-1 map = 50%, reduce = 0%
2010-02-03 18:43:05,981 Stage-1 map = 100%, reduce = 0%
2010-02-03 18:43:09,012 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201001251623_0094
Ended Job = -416940662, job is filtered out (removed at runtime).
Launching Job 2 out of 2
…
…
Loading data to table dw_log_ip_test
11110 Rows loaded to dw_log_ip_test
OK
#-- Now check the results...
hive > select dt, remote_ip from dw_log_ip;
hive > select dt, remote_ip, count(1)
> from dw_log_ip
> group by dt, remote_ip;
Enjoy,
Shiva
Hi ,
ReplyDeleteI am new to HIVE, want to use its java based API.
But I am not able to find proper documents on net . Please provide info of how to go abt using HIVE via program.
Hope you have found any relevant articles/ documents on HIVE. Please google for them.
DeleteHi,
ReplyDeleteI am new to HIVE..Please assist me to upload data directly from script running..
email : jkanniappan@perisoftware.com
Name : Jagan.k