Tuesday, June 4, 2013

Apache Sqoop - Part 4: Sqoop best practices

This is a work in progress and covers Sqoop best practices, gathered from various sites and literature.  The blog will be updated with anything new learned, corrections or contributions from readers.

Best practices:
1.  With imports, use the output line formatting options wherever possible, for accuracy of data transfer - "--enclosed -by", "--fields-terminated-by",  "--escaped-by"
2.  Use your judgement when you provide number of mappers  to ensure you appropriately parallelize the import without increasing overall completion time. (default - 4 tasks are run parallel)
3.  Use direct connectors where available for better performance
4.  With imports, use a boundary query for better performance
5.  When importing into Hive and using dynamic partitions, think through partition criteria and number of files generated...you dont want too many small files on your cluster;  Also, there is a limit on the number of partitions on each node.
6.  Be cognizant of the configuration of concurrent connections allowed to the database; Use fetch size for controlling number of records to be read from the database, and also factor in the number of parallel tasks.
7.  Do not use the same table for import and export.
8.  Use an options file for reusability
9. Be aware of case sensitivity nuances of Sqoop - you might save on time you would spend trouble-shooting issues.
10.  For exports, use a staging table where possible, during development phase, it will help with troubleshooting.
11.  Use the verbose argument (--verbose) for more information during trouble-shooting.


Trouble-shoting tips:
http://archive.cloudera.com/cdh4/cdh/4/sqoop-1.4.2-cdh4.2.0/SqoopUserGuide.html#_troubleshooting

My blogs on Sqoop:

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.



3 comments:

  1. Funtastic......blog..thanks a lot Anagha

    ReplyDelete
  2. Hi Anagha. You mentioned above "Use direct connectors where available for better performance". What does that mean? I know Sqoop has the ability to use vendor specific connectors. Are you recommending to use those or to go outside of Sqoop all together if a connector is available?

    ReplyDelete
  3. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete