Data and Technology: Hive

hive> CREATE TABLE access_log_2010_01_25 ( request_date STRING, remote_ip STRING, method STRING, request STRING, protocol STRING, user STRING, status STRING, size STRING, time STRING, remote_host STRING, ts STRING, perf STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "\\[([^]]+)\\] ([^ ]*) \"([^ ]*) ([^ ]*) ([^ \"]*)\" user=([^ ]*) status=([^ ]*) size=([^ ]*) time=([^ ]*) host=([^ ]*) timestamp=([^ ]*) perf=([^ ]*)", "output.format.string" = "%1$s %2$s \"%3$s %4$s %5$s\" user=%6$s status=%7$s size=%8$s time=%9$s host=%10$s timestamp=%11$s perf=%12$s" ) STORED AS TEXTFILE; hive> LOAD DATA LOCAL INPATH '/mnt/web_ser101/weblog_server101_20100125_1' > OVERWRITE INTO TABLE access_log_2010_01_25; #- After load the data in one of the record would look like: #- 25/Jan/2010:13:14:05 -0800 123.123.123.123 GET /xmls/public/thumbnail.xml HTTP/1.1 - 302 250 0 abcd.com 1264454045 -

If you are using Hive in default mode, you may see the following behavior - you get to hive client from different directories and see different results when you run a query like "show tables". For example, you have hive installed in /usr/local/hive and your are currently in your home directory and run

~> /usr/local/hive/bin/hive #-- get to hive
hive> create table new_table (c1 string);
hive> show tables;

Now you will see "new_table" in the list.

~> cd /tmp
/tmp> /usr/local/hive/bin/hive #-- get to hive
hive> show tables;

Now you don't see "new_table" in your list of tables. Those who come from typical SQL background may find it little weird in the beginning due to fact that results seem different depending on from where you started the hive client. The reason is because hive uses "embedded Derby" database to store meta data and one of the default configuration property is to use the current directory to store metastore_db.

On starting the hive from two different directories like above, one would see two "metastore_db" directories created in home (~) and /tmp directories. You can change this and use one metastore_db by updating "/usr/local/hive/conf/hive-default.xml" file's "javax.jdo.option.ConnectionURL" as below.

Default setting:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

Update it to:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/hadoop/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

"/home/hadoop" is an example and one can appropriately change it to suitable host and directory. Say,
<value>jdbc:derby:;databaseName=//localhost:1234/metastore_db;create=true</value>

Cheers,
Shiva

Log Parsing through Hadoop, Hive & Python

Hive Metastore Derby DB