Incremental Load in Sqoop

Consistency of Data
Apache Sqoop framework is helping to fetch the data from RDBMS to HDFS and/or HDFS to RDBMS. Typically RDBMS the data are keep on incremented or appended with the existing data and existing data has been updated (edit/update or delete). After data is imported to HDFS, if database table side some data’s are modified the same data consistency needs to be maintained in HDFS side.

That means the imported data in Hadoop has to be synch with source data (RDBMS). If the source data (RDBMS) is updated, to get the same synch data in Hadoop side; the complete fresh import is NOT an optimum solution. Sqoop has facility to cater this type of needs – The incremental load option in Sqoop command.
Incremental Load in SqoopDelta data imports
Ideal process in real-time scenario is synchronizing the delta data (modified or updated data) from RDBMS to Hadoop. Sqoop has incremental load command to facilitate the delta data.
Append in import command for tables where rows only get inserted.
Last-Modified in import command for the rows get inserted as well as updated.
Importing incremental data with Last-modified mode option
Workaround for delta data import
Sqoop is importing and saving as RDBMS table name as a file in HDFS. The last modified mode is importing the delta data and trying to save the same name which already present in HDFS side and it throw error since HDFS does not allow the same name file.

Here is workaround to get complete updated data in HDFS side
1. Move existing HDFS data to temporary folder
2. Run last modified mode fresh import
3. Merge with this fresh import with old data which saved in temporary folder.

6 thoughts on “Incremental Load in Sqoop

  1. Not a great idea to merge the data. The incremental lastmodified option should modify the row which is under the target directory. But I am not able to update the rows on HDFS. When I am trying to do incremental import with lastmodified option I am getting target directory already exists exception. I don’t think merge is a great option. Kindly suggest me if there is another mechanism.

  2. Hi Ganpathy , This post is really very useful. In my case I have a table contain student_id , Subject and the marks . There is no id(autoincreament)/primary key or date time fields. But still i would like pull the delta only , can you give me some idea ?

  3. Pingback: sqoop | iamwillingtolearn

  4. I need your suggestion. I need to export the one table data to Hadoop(HIVE). But my data (Oracle table) changes when issue comes. I want to export the data to hadoop for every 1 minute. It should not override the last data copied to HIVE table, just only append the data. Could you please provide your comments.

  5. Need your suggestion on export data from Hive to Oracle. Can we do a incremental export. Sqoop command doesnt allow to do. Any other suggestion or idea

Leave a Reply