RDBMS to Hive through Sqoop

The data from RDBMS can be import directly to Hive through Sqoop import command. Sqoop attribute hive-import does this job. The typical Sqoop command that import from RDBMS to Hive is as below.

Sqoop import \
–connect jdbc:mysql://<hostname>/<database-name> \
–username <username> \
–password <password> \
–table <RDBMS Table Name> \
–hive-import

–hive-import attribute create Hive table under Hive warehouse directory

Sqoop import does two jobs during importing the data into Hive
1. Create metadata
2. Create the Hive Table if it does not exist. (Source Table each column type will convert to the Hive compatible data type by Sqoop – Source table types like varchar, char and other string based types mapped to hive as “string”)
3. If Hive table already exist then Sqoop imports data directly to it.

Sqoop import to Hive

Image Courtesy : http://www.severalnines.com/blog/archival-analytics-importing-mysql-data-hadoop-cluster-sqoop

Sqoop to Hive – What is happening in behind?
We already know loading the data into Hive has two processes first create the table then load the data into that table. Sqoop to Hive import does the same here.
1. Sqoop import data into HDFS in temporary locate
2. Sqoop generates two queries – One for creating the table and next query is loading the data from temporary location.
3. To avoid the loading data issues, this temporary location should not be Hive warehouse directory.
4. This temporary location can be specify by the Sqoop attribute –target-dir

–map-column-hive
Sqoop coverts the source table column type to Hive compatible data type. If this mapping does not work, we need to use –map-column-hive attribute to overriding.
Example
–map-column-hive emp-id=Decimal, status=STRING
emp-id map to decimal and status column mapped to string in Hive.

Sqoop to Hive as regular import
In real-time environment where regular import is happening, we should not create the new Hive table each time. Instead the Hive table already exist with data. In this case Sqoop will append to the newly imported data.

However we can use another very important Sqoop parameter –hive-overwrite which truncate an existing Hive table and load only the newly imported table. This parameter is very helpful in real-time to get fresh Hive table data always.

Another useful Sqoop parameter using in regular import is
–hive-partition-key
–hive-partition-value
In this –hive-partition-key is the name of the partition column and –hive-partition-value is value. For example if we do Sqoop import for the today records then we can mention as –hive-partition-value “2014-12-10”.

Sqoop to Hive – Always create external table in Hive
we are discussing about regular Sqoop import to Hive. In this create external table in Hive – this step we must follow. Hive table has two types that are normal hive table and permanent hive table. I will write post detail on this concept later.

During regular Sqoop import to Hive, we need to create external hive table (External Hive tables are permanent table in hive) first then we import the data through Sqoop to Hive.

One thought on “RDBMS to Hive through Sqoop

Leave a Reply