- Published on
Using Hive for Small Datasets on my Mac using Docker
- Authors
- Name
- Ruan Bekker
- @ruanbekker
I wanted to process a small subset of data, and not wanting to spin up a cluster, so I used nagasuga/docker-hive
docker image to run Hive on my Mac.
Running Hive
$ docker run -it -v /home/me/resource-data.csv:/resource-data.csv nagasuga/docker-hive /bin/bash -c 'cd /usr/local/hive && ./bin/hive'
hive>
Once I was entered into my hive shell, I created a table for my CSV data:
Creating the Table
hive> create table resources (ResourceType STRING, Owner STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
hive> show tables;
OK
resources
hive> describe resources;
OK
resourcetype string
owner string
Loading the Data
My csv data is located at /resource-data.csv
on the container, which I will load into my table:
hive> load data local inpath '/resource-data.csv' into table resources;
Loading data to table default.resources
Query the Data
Just two simple queries for demonstration:
hive> select * from resources limit 3;
OK
EC2 Engineering
EC2 Finance
EC2 Product
hive> select count(resourcetype) as num, owner from resources group by owner order by num desc limit 3;
K
50 Engineering
20 Product
10 Finance
Resource:
Thanks to https://github.com/nagasuga/docker-hive
Thank You
Thanks for reading, feel free to check out my website, and subscribe to my newsletter or follow me at @ruanbekker on Twitter.
- Linktree: https://go.ruan.dev/links
- Patreon: https://go.ruan.dev/patreon
Please feel free to show support by, sharing this post, making a donation, subscribing or reach out to me if you want me to demo and write up on any specific tech topic.