è
Import data from RDBMS to
HDFS
Sqoop-import –connect
“mysql:jdbc://quickstart.cloudera:3306/database_name” \
--username “root”
--password “cloudera”
--table customers
--target-dir “\user\cloudera\output”
--m 1 (or –num-mappers 1)
è
Import data from RDBMS to
Hive
Sqoop-import –connect
“mysql:jdbc://quickstart.cloudera:3306/database_name” \
--username “root”
--password “cloudera”
--table customers
--hive-import
--hive-database sqoop_import_hive_db
--hive-overwrite
--compress \
--compression-codec
org.apache.hadoop.io.compress.SnappyCodec \
--null-string
"NULL_REPLEACE" \
--null-non-string
"NON_STRING_NULL_REPLACE"
--incremental append \
--check-column id \
--last-value 6
--append \
è
Export data from HDFS to
RDBMS
sqoop export \
--connect
jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table temp_null \
--export-dir
"/user/cloudera/sqoop_import_null" \
--input-fields-terminated-by
',' \
--input-lines-terminated-by
'\n' \
--input-null-string
"NULL_REPLEACE" \
--input-null-non-string
"NON_STRING_NULL_REPLACE"
è
Export
Data from Hive to RDBMS
sqoop export \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username "retail_dba" \
--password "cloudera" \
--table hive_export \
--export-dir
"/user/hive/warehouse/sqoop_import.db/departments" \
--m 1 \
--input-null-string nvl \
--input-null-non-string -1 \
--input-fields-terminated-by '\001' \
--input-lines-terminated-by '\n'
Sqoop
job creation, merging data and optimizing Sqoop jobs.
è
Data
merging using sqoop
sqoop merge --merge-key
department_id \
--new-data
"/user/cloudera/sqoop_merge" \
--onto "/user/cloudera/sqoop_import/departments"
\
--target-dir
"/user/cloudera/sqoop_import/sqoop_merge_target" \
--class-name departments \
--jar-file
/tmp/sqoop-cloudera/compile/a9643b6f412b3e44e3a84e6f7ba0be1f/departments.jar
è
Create
sqoop job
sqoop job --create sqoop_job \
-- import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
--password=cloudera \
--table orders \
--target-dir "/user/cloudera/sqoop_job_dir" \
--m 1
sqoop job --list
sqoop job --show sqoop_job
sqoop job --exec sqoop_job
è
Sqoop
job optimization
-Number of mappers
Increasing number of mappers will
fasten transfer speed because it divides the task in parts and make import
process parallel.
-Balanced load on mappers
You need to split on columns that
is uniform (prefer integer) that will give balanced load to all the mappers and
transfer is faster.
--split-by "customer_id"
-Number of connection from RDBMS
We cannot just increase the
number of mappers blindly(like 100 or
more). Your rdbms should allow all those concurrent connections otherwise it
will be bottle neck from RDBMS.
-use –direct mode
If direct mode is supported by
RDBMS then you should import data using –direct mode as it would not launch
mappers task
1.
--direct
is only supported in mysql and postgresql.
2.
Sqoop’s direct mode does not support imports of
BLOB
, CLOB
, or LONGVARBINARY
columns.
No comments:
Post a Comment