Consistency of Data
Apache Sqoop framework is helping to fetch the data from RDBMS to HDFS and/or HDFS to RDBMS. Typically RDBMS the data are keep on incremented or appended with the existing data and existing data has been updated (edit/update or delete). After data is imported to HDFS, if database table side some data’s are modified the same data consistency needs to be maintained in HDFS side.
That means the imported data in Hadoop has to be synch with source data (RDBMS). If the source data (RDBMS) is updated, to get the same synch data in Hadoop side; the complete fresh import is NOT an optimum solution. Sqoop has facility to cater this type of needs – The incremental load option in Sqoop command.
Delta data imports
Ideal process in real-time scenario is synchronizing the delta data (modified or updated data) from RDBMS to Hadoop. Sqoop has incremental load command to facilitate the delta data.
Append in import command for tables where rows only get inserted.
Last-Modified in import command for the rows get inserted as well as updated.
Importing incremental data with Last-modified mode option
Workaround for delta data import
Sqoop is importing and saving as RDBMS table name as a file in HDFS. The last modified mode is importing the delta data and trying to save the same name which already present in HDFS side and it throw error since HDFS does not allow the same name file.
Here is workaround to get complete updated data in HDFS side
1. Move existing HDFS data to temporary folder
2. Run last modified mode fresh import
3. Merge with this fresh import with old data which saved in temporary folder.