Hooked on Hadoop: Apache Sqoop - Part 4: Sqoop best practices

Tuesday, June 4, 2013

Apache Sqoop - Part 4: Sqoop best practices

This is a work in progress and covers Sqoop best practices, gathered from various sites and literature. The blog will be updated with anything new learned, corrections or contributions from readers.

Best practices:
1. With imports, use the output line formatting options wherever possible, for accuracy of data transfer - "--enclosed -by", "--fields-terminated-by", "--escaped-by"
2. Use your judgement when you provide number of mappers to ensure you appropriately parallelize the import without increasing overall completion time. (default - 4 tasks are run parallel)
3. Use direct connectors where available for better performance
4. With imports, use a boundary query for better performance
5. When importing into Hive and using dynamic partitions, think through partition criteria and number of files generated...you dont want too many small files on your cluster; Also, there is a limit on the number of partitions on each node.
6. Be cognizant of the configuration of concurrent connections allowed to the database; Use fetch size for controlling number of records to be read from the database, and also factor in the number of parallel tasks.
7. Do not use the same table for import and export.
8. Use an options file for reusability
9. Be aware of case sensitivity nuances of Sqoop - you might save on time you would spend trouble-shooting issues.
10. For exports, use a staging table where possible, during development phase, it will help with troubleshooting.
11. Use the verbose argument (--verbose) for more information during trouble-shooting.

Trouble-shoting tips:
http://archive.cloudera.com/cdh4/cdh/4/sqoop-1.4.2-cdh4.2.0/SqoopUserGuide.html#_troubleshooting

My blogs on Sqoop:

Blog 1: Import from mysql into HDFS
Blog 2: Import from mysql into Hive
Blog 3: Export from HDFS and Hive into mysql
Blog 4: Sqoop best practices
Blog 5: Scheduling of Sqoop tasks using Oozie
Blog 6: Sqoop2

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

4 comments:

developers-spotMarch 19, 2014 at 11:16 AM
Funtastic......blog..thanks a lot Anagha
ReplyDelete
Replies
UnknownSeptember 26, 2016 at 6:40 AM
Hi Anagha. You mentioned above "Use direct connectors where available for better performance". What does that mean? I know Sqoop has the ability to use vendor specific connectors. Are you recommending to use those or to go outside of Sqoop all together if a connector is available?
ReplyDelete
Replies
UnknownJuly 8, 2018 at 5:22 AM
thakyou it vry nice blog for beginners
https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/
ReplyDelete
Replies
RenuAugust 17, 2018 at 3:09 AM
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

ReplyDelete
Replies

Add comment

Hooked on Hadoop

Tuesday, June 4, 2013

Apache Sqoop - Part 4: Sqoop best practices

4 comments:

Search

Blog archive

Popular Posts

Total Pageviews