Best practices:
1. With imports, use the output line formatting options wherever possible, for accuracy of data transfer - "--enclosed -by", "--fields-terminated-by", "--escaped-by"
2. Use your judgement when you provide number of mappers to ensure you appropriately parallelize the import without increasing overall completion time. (default - 4 tasks are run parallel)
3. Use direct connectors where available for better performance
4. With imports, use a boundary query for better performance
5. When importing into Hive and using dynamic partitions, think through partition criteria and number of files generated...you dont want too many small files on your cluster; Also, there is a limit on the number of partitions on each node.
6. Be cognizant of the configuration of concurrent connections allowed to the database; Use fetch size for controlling number of records to be read from the database, and also factor in the number of parallel tasks.
7. Do not use the same table for import and export.
8. Use an options file for reusability
9. Be aware of case sensitivity nuances of Sqoop - you might save on time you would spend trouble-shooting issues.
10. For exports, use a staging table where possible, during development phase, it will help with troubleshooting.
11. Use the verbose argument (--verbose) for more information during trouble-shooting.
Trouble-shoting tips:
http://archive.cloudera.com/cdh4/cdh/4/sqoop-1.4.2-cdh4.2.0/SqoopUserGuide.html#_troubleshooting
My blogs on Sqoop:
Blog 1: Import from mysql into HDFS
Blog 2: Import from mysql into Hive
Blog 3: Export from HDFS and Hive into mysql
Blog 4: Sqoop best practices
Blog 5: Scheduling of Sqoop tasks using Oozie
Blog 6: Sqoop2
Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.
Funtastic......blog..thanks a lot Anagha
ReplyDeleteHi Anagha. You mentioned above "Use direct connectors where available for better performance". What does that mean? I know Sqoop has the ability to use vendor specific connectors. Are you recommending to use those or to go outside of Sqoop all together if a connector is available?
ReplyDeletethakyou it vry nice blog for beginners
ReplyDeletehttps://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
ReplyDeletehttps://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/