8Lab 3: Hive
For this lab,
1. You will investigate how Hive works; create, load, query and store data in Apache Hive in
both our HU cloud platform and MapR sandbox.
2. You will compare the Hive performance between our HU cloud platform and MapR sandbox.
* Most contents are coming from https://learn.mapr.com/ by the permission of MapR
technology.
Prerequisite:
To our HU cloud platform,
For Hadoop Cluster Overview,
http://hdfs-namenode-hadoop.apps.myhu.cloud/dfshealth.html#tab-overview
For accessing Hadoop cluster nodes,
https://master1.myhu.cloud:8443/console/project/hadoop/browse/pods
User Name : hadoop, passwd: hadoop
To access the terminal of name node, please, click “hdfs-namenode-0” and then click
“Terminal”.
1. Create the folder named with your student id and work under the folder created.
2. To upload file, please use “wget” or other commands you like.
3. If you have any problem/issue with HU cloud, please report it to your submission and
use google cloud or Amazon cloud
4. If you have any problem/issue with HU cloud, google cloud or amazon cloud, please
report it to your submission and work only with MapR sandbox.
For using MapR sandbox,
Please, download the one of the MapR sandboxes listed below.
• VMware Course Sandbox: http://package.mapr.com/releases/v5.1.0/sandbox/MapR-
Sandbox-
https://learn.mapr.com/
http://hdfs-namenode-hadoop.apps.myhu.cloud/dfshealth.html#tab-overview
https://master1.myhu.cloud:8443/console/project/hadoop/browse/pods
For-Hadoop-5.1.0-vmware.ova
• VirtualBox Course Sandbox: http://package.mapr.com/releases/v5.1.0/sandbox/MapR-
Sandbox-
For-Hadoop-5.1.0.ova
For the installation, please refer to https://mapr.com/docs/52/SandboxHadoop/c_sandbox_overview.html
Logging in to the Command Line
● Before you get started, you'll want to have the IP address handy for your Sandbox VM.
See the screenshot below for an example of where to find that.
● Next, use an SSH client such as Putty (Windows) or Terminal (Mac) to login. See below
for an example:
● use userid: user01 and password: mapr.
●
● For VMWare use: $ ssh user01@ipaddress
● For Virtualbox use: $ ssh user01@127.0.0.1 -p 2222
For MapR sandbox,
Connect to the Hive CLI
The lab file contains data and source code you will use to complete the lab exercises.
1. Log in to your cluster as user01 (password is mapr).
2. Position yourself in the /user/user01 directory in the cluster file system:
$ cd /mapr/MyCluster/user/user01
3. Then, download and unzip the lab files
$ wget http://course-files.mapr.com/DA4400-R1/DA440-LabFiles.zip
$ unzip DA440-LabFiles.zip
Connect to the Hive Shell
1. Run the Hive shell program by typing hive in a terminal connected to the MapR Sandbox.
[user01@maprdemo ~]$ hive
hive>
Note: You are now in the Hive CLI. For the remainder of this lab, > is prompting a Hive
command in the Hive CLI, while $ is prompting a bash command in your terminal.
You may find it useful to type commands in a text editor rather than typing commands
directly into the command line. Copy and pasting commands from a text editor will allow you
to edit and save them, as well as allowing you to write longer queries.
2. Use the SHOW FUNCTIONS command to list available Hive Query Language functions.
3. Type the following SQL data definition in the Hive shell:
CREATE TABLE ebay.auction
(openingBid FLOAT,finalBid FLOAT,itemType STRING,days INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
4. Load the eBay auction data with the following command:
LOAD DATA LOCAL INPATH
'file:///user/user01/DA440-LabFiles/auctiondata.csv'
INTO TABLE ebay.auction;
Submission!
5. Try querying this data with SQL commands you know. Do they work as you expect them to?
Create a Database
1. Use Hive data definition language (DDL) to create a database. Create the database in your home
directory, and name it the same as your user name. For example:
> CREATE DATABASE user01 LOCATION
'/user/user01/hive/user01.db';
2. Use the SHOW DATABASES command to list all the databases available in this Hive instance.
hive> SHOW DATABASES;
OK
default
user01
Time taken: 0.127 seconds, Fetched: 2 row(s)
You should see your new user01 database now.
Note: Hive Query Language (HQL) commands are shown in upper case. This is a
convention, not a requirement. HQL commands are case-insensitive, and may be
written in either upper or lowercase. All HQL statements must end with a semicolon.
3. Quit the Hive shell, and look at the database from your bash shell:
hive> quit;
$ hadoop fs -ls /user/user01/hive
You should see the user01 database.
Create a Simple Table
1. Log back in to the Hive shell:
$ hive
hive >
2. Create a location table inside the user01 database, with the characteristics listed below. Create a
location table with the following characteristics:
• A station column, of type string
• A latitude column, of type integer
• A longitude column, of type integer
• A row format of delimited
• Fields terminated by comma
• Lines terminated by the line feed character
• Stored as a text file
3. Show the tables in your database:
hive> SHOW TABLES IN user01;
4. Show the characteristics of the table:
Submission: Command and the result of command (Screen capture)
5. Drop the table:
hive: DROP TABLE user01.location;
6. Recreate the table, but this time name the second column city instead of latitude.
7. Show the table you created.
8. Rename the city column to latitude:
hive> ALTER TABLE user01.location CHANGE COLUMN city latitude INT;
Create Partitioned and External Tables
Partitioning data can speed up queries and optimize results. Create the windspeed table as a partitioned
table with the following characteristics
• A year column, of type integer
• A month column, of type string
• A knots column, of type float
• A partition using station as the column, of type string
• Delimited row format
• Fields terminated by comma
• Lines terminated by linefeed
• Stored as a text file
Create an External Table
Create an external table called temperature that uses a text file stored in your lab files folder, with the
following characteristics:
• A station column, of type string
• A year column, of type integer
• A month column, of type string
• A celsius column, of type float
• Delimited row format
• Fields terminated by comma
• Lines terminated by linefeed
• Stored as a text file
• A location pointing to the temperature folder in your lab files folder
Submission
Screen Capture for SELECT * FROM user01.temperature LIMIT 10;
Load Data into Tables
1. Use LOAD DATA to load the location table. Remember to replace user01 with your own
userID in the file path and the database notation if necessary.
2. Load data into the partitioned table, windspeed. Since this table is partitioned, we’ll have to add
the PARTITION clause to the LOAD DATA command.
3. You can also explore the warehouse directory using Hadoop FS commands to see
how the partitioned table is laid out. Exit Hive using QUIT;, then enter:
$ hadoop fs -ls /user/user01/hive/user01.db/windspeed
Submission
Screen Capture for $ hadoop fs -ls /user/user01/hive/user01.db/windspeed
Examine Databases and Tables
The location, windspeed and temperature tables should have data in them. If you are
familiar with SQL, run some basic queries on these tables. Here are some queries to try:
Submission
Screen Capture for SELECT * FROM location;
Screen Capture for SELECT count(*) FROM windspeed;
Screen Capture for SELECT * FROM windspeed LIMIT 20;
Screen Capture for SELECT * FROM temperature WHERE year = 2000;
Query Data with SELECT
1. First, explore the temperature table. This table holds the average monthly temperatures, in
degrees Celsius, from eight different weather stations in Antarctica from several decades.
2.Let’s look at all the temperatures from January, 1970, which is the time when Unix time began:
Submission
Command and Screen capture for the command
2. Let’s try the same query, but for July, when it is winter in Antarctica:
Submission
Command and Screen capture for the command
3. The name of the weather station at the South Pole is called Clean Air, because very little manmade
pollution can be found there. Let’s find the temperatures in July at the South Pole:
Submission
Command and Screen capture for the command
4. Find the average temperature in Antarctica in 1970:
Submission
Command and Screen capture for the command
5. Run #4 in HU Cloud
Submission
Compare the performance between MapR and HU Cloud.
5. Find the hottest and coldest temperatures recorded in Antarctica:
Submission
Command and Screen capture for the command
6. Run #5 in HU Cloud
Submission
Compare the performance between MapR and HU Cloud.