Wednesday, October 4, 2017

Sqoop Import Export

è Import data from RDBMS to HDFS
Sqoop-import –connect “mysql:jdbc://quickstart.cloudera:3306/database_name” \
--username “root”
--password “cloudera”
--table customers
--target-dir “\user\cloudera\output”

--m 1 (or –num-mappers 1)

è Import data from RDBMS to Hive

Sqoop-import –connect “mysql:jdbc://quickstart.cloudera:3306/database_name” \
--username “root”
--password “cloudera”
            --table customers
            --hive-import
            --hive-database sqoop_import_hive_db
            --hive-overwrite

  --compress \
  --compression-codec org.apache.hadoop.io.compress.SnappyCodec \

 --null-string "NULL_REPLEACE" \
 --null-non-string "NON_STRING_NULL_REPLACE"

--incremental append \
 --check-column id \
 --last-value 6

--append \

è Export data from HDFS to RDBMS

sqoop export \
 --connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
 --username retail_dba \
 --password cloudera \
 --table temp_null \
 --export-dir "/user/cloudera/sqoop_import_null" \
 --input-fields-terminated-by ',' \
 --input-lines-terminated-by '\n' \
 --input-null-string "NULL_REPLEACE" \
 --input-null-non-string "NON_STRING_NULL_REPLACE"


è Export Data from Hive to RDBMS

sqoop export \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username "retail_dba" \
  --password "cloudera" \
  --table hive_export \
  --export-dir "/user/hive/warehouse/sqoop_import.db/departments" \
  --m 1 \
  --input-null-string nvl \
  --input-null-non-string -1 \
  --input-fields-terminated-by '\001' \
  --input-lines-terminated-by '\n'

Sqoop job creation, merging data and optimizing Sqoop jobs.

è Data merging using sqoop

sqoop merge --merge-key department_id \
 --new-data "/user/cloudera/sqoop_merge" \
 --onto "/user/cloudera/sqoop_import/departments" \
 --target-dir "/user/cloudera/sqoop_import/sqoop_merge_target" \
 --class-name departments \
 --jar-file /tmp/sqoop-cloudera/compile/a9643b6f412b3e44e3a84e6f7ba0be1f/departments.jar

è Create sqoop job

sqoop job --create sqoop_job \
  -- import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --table orders \
  --target-dir "/user/cloudera/sqoop_job_dir" \
  --m 1

sqoop job --list

sqoop job --show sqoop_job

sqoop job --exec sqoop_job

è Sqoop job optimization
            -Number of mappers
Increasing number of mappers will fasten transfer speed because it divides the task in parts and make import process parallel.
-Balanced load on mappers                       
You need to split on columns that is uniform (prefer integer) that will give balanced load to all the mappers and transfer is faster.
--split-by "customer_id"
-Number of connection from RDBMS
We cannot just increase the number of mappers  blindly(like 100 or more). Your rdbms should allow all those concurrent connections otherwise it will be bottle neck from RDBMS.
-use –direct mode
If direct mode is supported by RDBMS then you should import data using –direct mode as it would not launch mappers task
1.      --direct is only supported in mysql and postgresql.
2.      Sqoop’s direct mode does not support imports of BLOB, CLOB, or LONGVARBINARY columns.

No comments:

Post a Comment