Using Hive for Small Datasets on my Mac using Docker


I wanted to process a small subset of data, and not wanting to spin up a cluster, so I used nagasuga/docker-hive docker image to run Hive on my Mac.

Running Hive

$ docker run -it -v /home/me/resource-data.csv:/resource-data.csv nagasuga/docker-hive /bin/bash -c 'cd /usr/local/hive && ./bin/hive'

Once I was entered into my hive shell, I created a table for my CSV data:

Creating the Table

hive> create table resources (ResourceType STRING, Owner STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

hive> show tables;

hive> describe resources;
resourcetype        	string
owner               	string

Loading the Data

My csv data is located at /resource-data.csv on the container, which I will load into my table:

hive> load data local inpath '/resource-data.csv' into table resources;
Loading data to table default.resources

Query the Data

Just two simple queries for demonstration:

hive> select * from resources limit 3;
EC2	 Engineering
EC2	 Finance
EC2	 Product

hive> select count(resourcetype) as num, owner from resources group by owner order by num desc limit 3;
50	 Engineering
20	 Product
10	 Finance


