Apache Sqoop – Only Mappers with No Reducers

Apache Sqoop – The Bridge between RDBMS and Hadoop
Apache Sqoop can import/export from/to RDBMS and HDFS. It can be installed in any node in the Hadoop cluster. Typically in production it installed in separate node. The best practice is always updated .bashrc file after Sqoop binary extraction and configuration. It will help to run Sqoop from anywhere in the node command line interface.

Sqoop connected with RDBMS by JDBC drivers and its first getting metadata of the database table. After querying the database table metadata Sqoop submits the map reduce job to Hadoop or Hadoop cluster which ever connected with the setup. Sqoop submit only Mapper function and there is no reducer. This submission is nothing but actual data transfer. The Database’ table data are sliced and transfer to the mapper. This transfer is happened in parallel to the mappers in Hadoop.
apache-sqoop-importOutput in HDFS
The typical Sqoop import command found as below
sqoop-import-commandIn this sqoop     — Hadoop executable
import     — Tool property that means operations that want to perform
connect  —  Property arguments like JDBC Connection URL
table      —  Sqoop Arguments that usually contains various Sqoop parameters

In this command we are not mentioning anything about where the output files going to store in HDFS and its name. Sqoop usually take care of this automatically by creating the folder in the name of RDBMS table name and saved the table data into that folder as below. The output is stored as four part files since Sqoop has four default mappers and it is based on boundary value query concept. After import we can see the output like below. In this “departments “ is RDBMS table name.sqoop-output

Control the output
Sqoop output in Hadoop can be controlled by sqoop command sqoop argument –target-dir  – as below.

hduser@ubuntu:~/hadoop$ sqoop import –connect jdbc:mysql://localhost/sample –username root –password root –table employees –where “emp_id > 3” –target-dir /user/hduser/todaytest –m 1

-m 1 is indicating that single mapper instead of default mappers 4 in sqoop. Sqoop will store the table content into single output directory. we can found Sqoop log as below (tail part)

14/12/04 03:59:23 INFO mapred.JobClient: Running job: job_201412040333_0005
14/12/04 03:59:24 INFO mapred.JobClient: map 0% reduce 0%
14/12/04 04:00:19 INFO mapred.JobClient: map 100% reduce 0%
14/12/04 04:00:39 INFO mapred.JobClient: Job complete: job_201412040333_0005
14/12/04 04:00:39 INFO mapred.JobClient: Counters: 18
14/12/04 04:00:39 INFO mapred.JobClient:   Job Counters
14/12/04 04:00:39 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=65070
14/12/04 04:00:39 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/12/04 04:00:39 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/12/04 04:00:39 INFO mapred.JobClient:     Launched map tasks=1
14/12/04 04:00:39 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/12/04 04:00:39 INFO mapred.JobClient:   File Output Format Counters
14/12/04 04:00:39 INFO mapred.JobClient:     Bytes Written=729
14/12/04 04:00:39 INFO mapred.JobClient:   FileSystemCounters
14/12/04 04:00:39 INFO mapred.JobClient:     HDFS_BYTES_READ=87
14/12/04 04:00:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=64025
14/12/04 04:00:39 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=729
14/12/04 04:00:39 INFO mapred.JobClient:   File Input Format Counters
14/12/04 04:00:39 INFO mapred.JobClient:     Bytes Read=0
14/12/04 04:00:39 INFO mapred.JobClient:   Map-Reduce Framework
14/12/04 04:00:39 INFO mapred.JobClient:     Map input records=21
14/12/04 04:00:39 INFO mapred.JobClient:     Physical memory (bytes) snapshot=41598976
14/12/04 04:00:39 INFO mapred.JobClient:     Spilled Records=0
14/12/04 04:00:39 INFO mapred.JobClient:     CPU time spent (ms)=1550
14/12/04 04:00:39 INFO mapred.JobClient:     Total committed heap usage (bytes)=16252928
14/12/04 04:00:39 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=348475392
14/12/04 04:00:39 INFO mapred.JobClient:     Map output records=21
14/12/04 04:00:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=87
14/12/04 04:00:39 INFO mapreduce.ImportJobBase: Transferred 729 bytes in 79.9055 seconds (9.1233 bytes/sec)
14/12/04 04:00:39 INFO mapreduce.ImportJobBase: Retrieved 21 records.
hduser@ubuntu:~/hadoop$

And output is stored in single directory (the name as mentioned in –target-dir argument). Inside that directory one part file has all the content.sqoop-control-the-output

Leave a Reply