1. With imports, use the output line formatting options wherever possible, for accuracy of data transfer - "--enclosed -by", "--fields-terminated-by", "--escaped-by"
2. Use your judgement when you provide number of mappers to ensure you appropriately parallelize the import without increasing overall completion time. (default - 4 tasks are run parallel)
3. Use direct connectors where available for better performance
4. With imports, use a boundary query for better performance
5. When importing into Hive and using dynamic partitions, think through partition criteria and number of files generated...you dont want too many small files on your cluster; Also, there is a limit on the number of partitions on each node.
6. Be cognizant of the configuration of concurrent connections allowed to the database; Use fetch size for controlling number of records to be read from the database, and also factor in the number of parallel tasks.
7. Do not use the same table for import and export.
8. Use an options file for reusability
9. Be aware of case sensitivity nuances of Sqoop - you might save on time you would spend trouble-shooting issues.
10. For exports, use a staging table where possible, during development phase, it will help with troubleshooting.
11. Use the verbose argument (--verbose) for more information during trouble-shooting.
My blogs on Sqoop:
Blog 1: Import from mysql into HDFS
Blog 2: Import from mysql into Hive
Blog 3: Export from HDFS and Hive into mysql
Blog 4: Sqoop best practices
Blog 5: Scheduling of Sqoop tasks using Oozie
Blog 6: Sqoop2
If you want to share your thoughts/updates, email me at firstname.lastname@example.org.