RDBMS to HBase through Sqoop

Sqoop is the Apache framework to fetch the data from RDBMS to HDFS and HDFS to RDBMS. HBase is Hadoop No SQL columnar database. We can get the data from RDBMS and can store into HBase directly by the below command

sqoop import \
–connect jdbc:mysql://hostname-or-ip/<dbname \
–username <username> \
–password <password> \
–table <source tableName> \
–hbase-table <hbase_tableName> \
–column-family <Column Family Name> \
–hbase-row-key <attr1,attr2 > \
–hbase-create-table

The following are mandatory parameters
1. The Table name
2. A column family name within the table
3. The id of the row into which you are inserting data.
 

sqoop-import-to-hbase

Image courtesy : http://www.devx.com/Java/hadoop-sqoop-to-import-data-from-mysql.html

–column-family <column Family>
This Sqoop parameter must specify to which column family that Sqoop import the data.

–hbase-table <table name>
This parameter also we must use to specify the name of Hbase table that sqoop import the data.

–hbase-create-table <table name>
If this sqoop attribute specified then it will create missing HBase tables

–hbase-row-key <col>
This attribute helps to specify the input column to use as a row key

Row.Key option
if you imported data into HDFS that is fewer rows than in your source table. Then you need to enable inserting the row key by specifying property like below (from MySQL to HBase)

sqoop import \
-Dsqoop.hbase.add.row.key=true \
–connect jdbc:mysql://hostname-or-ip/<dbname \
–username <username> \
–password <password> \
–table <source table name> \
–hbase-table <hbase table name> \
–column-family <column family name>

Sqoop import to HBase take more time
Sqoop import to HBase take more time than Sqoop import to normal text file into HDFS. We have to do the below steps prior to Sqoop import to HBase to achieve better performance.

1. Create HBase table
2. Create more regions with the parameter NUMREGIONS

The new HBase table usually has only one region by default. Sqoop does parallel import of your data into HBase, but the parallel tasks will bottleneck when inserting data into one single region. So to avoid this if more regions are in HBase before import, during the Sqoop import the data load spread across the entire HBase cluster.

Leave a Reply