Boundary Value Query in Sqoop

Sqoop gauges its workload
Sqoop has perform parallel imports. The default mappers are 4 that means it took four splitting tasks. Sqoop uses splitting columns of RDBMS table to split the workload. It splits by identify the primary key column of the RDBMS table. The total ranges of columns are evenly split by taking low and high values.

Example if table has 100 numbers of rows which identify by primary key
emp-id 1 to 100. Sqoop has use 4 mapper task and it take
emp-id 1 to 25 – One Mapper Task
emp-id 26 to 50 – One Mapper Task
emp-id 51 to 75 – One Mapper Task
emp-id 76 to 100 – One Mapper Task

What is Boundary Value Query in Sqoop
Sqoop run it mapper task by execute the SQL like SELECT * FROM table WHERE id >= low AND id < high. Sqoop uses query select minimum value for splitting, maximum value for splitting to find out boundaries for creating splits. This Sqoop operation is known as Boundary Value Query.

Unbalanced Mapper Task
If the table
values for the primary key are not uniformly distributed across its range then sqoop will splitting unbalanced mapper task. The result file in HDFS may have unbalanced data. In this we need to explicitly specify –split-by sqoop argument.

Sqoop import –connect jdbc:mysql//localhost/sample –username root –password root – table employees --split-by emp_id

At current level Sqoop does not split on multi-column indices that means if table has no index column (primary key) or it has multi-column key then we need to manually split the column.

Leave a Reply