Tuesday, June 4, 2013

Apache Sqoop - Part 3: Export from HDFS/Hive into mysql

What's in the blog?

My notes on exporting data out of HDFS and Hive into mySQL with examples that one can try out.  My first blog on Apache Sqoop covers mysql installation and sample data setup.  Some of the examples in this blog reference the mysql sample data, from my first blog on Sqoop. 

Versions covered:
Sqoop (1.4.2) with Mysql (5.1.69 ) 

Topics covered:

A. Exporting out of HDFS into mysql
    A1. Sample data prep
    A2.1. Export in insert mode, using staging table
    A2.2. Export in update mode
    A2.3. Export in upsert mode

B. Exporting out of Hive into mysql - in insert mode
    B1. Sample data prep
    B2. Exporting non-partitioned Hive table into mysql
    B3. Exporting partitioned Hive table into mysql

C. Exporting out of Hive into mysql in update mode
    C1. Sample data prep
    C2. Sqoop export command for updates

D. Exporting out of Hive into mysql in upsert mode

My blogs on Sqoop:

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

Apache Sqoop documentation on the "export" tool

Exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Sqoop uses the multi-row INSERT syntax to insert up to 100 records per statement. Every 100 statements, the current transaction within a writer task is committed, causing a commit every 10,000 rows. This ensures that transaction buffers do not grow without bound, and cause out-of-memory conditions. Therefore, an export is not an atomic process. Partial results from the export will become visible before the export is complete.


Exports may fail for a number of reasons:
  • Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes)
  • Attempting to INSERT a row which violates a consistency constraint (for example, inserting a duplicate primary key value)
  • Attempting to parse an incomplete or malformed record from the HDFS source data
  • Attempting to parse records using incorrect delimiters
  • Capacity issues (such as insufficient RAM or disk space)
If an export map task fails due to these or other reasons, it will cause the export job to fail. The results of a failed export are undefined. Each export map task operates in a separate transaction. Furthermore, individual map tasks commit their current transaction periodically. If a task fails, the current transaction will be rolled back. Any previously-committed transactions will remain durable in the database, leading to a partially-complete export.


A. Exporting out of HDFS into mysql

A1. Prep work

A1.1. Create a table in mysql:


mysql> CREATE TABLE employees_export (
  emp_no int(11) NOT NULL,
  birth_date date NOT NULL,
  first_name varchar(14) NOT NULL,
  last_name varchar(16) NOT NULL,
  gender enum('M','F') NOT NULL,
  hire_date date NOT NULL,
  PRIMARY KEY (emp_no)
);

A1.2. Create a stage table in mysql:

mysql > CREATE TABLE employees_exp_stg (
  emp_no int(11) NOT NULL,
  birth_date date NOT NULL,
  first_name varchar(14) NOT NULL,
  last_name varchar(16) NOT NULL,
  gender enum('M','F') NOT NULL,
  hire_date date NOT NULL,
  PRIMARY KEY (emp_no)
);

A1.3 Import some data into HDFS:

sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,birth_date,first_name,last_name,gender,hire_date from employees where $CONDITIONS' \
--split-by EMP_NO \
--direct \
--target-dir /user/airawat/sqoop-mysql/Employees


A2. Export functionality

A2.1. Export in insert mode, using staging table

Pretty straight-forward...as you can see.

A2.1.1. Sqoop command

sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \

--username MyUID \
--password myPWD \
--table employees_export  \
--staging-table employees_exp_stg \
--clear-staging-table \
-m 4 \
--export-dir /user/airawat/sqoop-mysql/Employees
.
.
.
13/06/04 09:54:18 INFO manager.SqlManager: Migrated 300024 records from `employees_exp_stg` to `employees_export`


A2.1.2. Results

mysql> select * from employees_export limit 1;
+--------+------------+------------+-----------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date  |
+--------+------------+------------+-----------+--------+------------+
| 200000 | 1960-01-11 | Selwyn     | Koshiba   | M      | 1987-06-05 |
+--------+------------+------------+-----------+--------+------------+


mysql> select count(*) from employees_export;
+----------+
| count(*) |
+----------+
|   300024 |
+----------+

mysql> select * from employees_exp_stg;
Empty set (0.00 sec)

Note: Even without the clear staging table argument, I found that the staging table was empty, however, the command output clearly indicates that the staging table was used.


A2.2. Export in update mode

A2.2.1. Prep:
I am going to set hire date to null for some records, for trying this functionality out.

mysql>  update employees_export set hire_date = null where emp_no >400000;
Query OK, 99999 rows affected, 65535 warnings (1.26 sec)
Rows matched: 99999  Changed: 99999  Warnings: 99999

A2.2.2. Sqoop command:
Next, we will export the same data to the same table, and see if the hire date is updated.

$ sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--table employees_export  \
--direct \
--update-key emp_no \
--update-mode updateonly \
--export-dir /user/airawat/sqoop-mysql/Employees


A2.2.3. Results:
mysql> select count(*) from employees_export where  hire_date is null;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.22 sec)


A2.3. Export in upsert mode

Upsert = insert if does not exist, update if exists.
Note: Mysql direct connector does not work for mysql.

A2.3.1. Prep:
I will update a few records in my sql to have null as hire date;  Will delete a few records; Will then run an upsert, tro try out this functionality.


mysql> update employees_export set hire_date = null where emp_no >200000;

mysql> delete from employees_export where emp_no >400000;

mysql> select 'Number of records with hire date blank (should become zero)' Note,count(*) Counts from employees_export where hire_date is null
    -> union 
    -> select 'Number of records (should get to 300024)' Note,count(*) from employees_export;
+-------------------------------------------------------------+--------+
| Note                                                        | Counts |
+-------------------------------------------------------------+--------+
| Number of records with hire date blank (should become zero) | 100000 |
| Number of records (should get to 300024)                    | 200025 |
+-------------------------------------------------------------+--------+


A2.3.2. Sqoop command:

sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--table employees_export  \
--update-key emp_no \
--update-mode allowinsert \
--export-dir /user/airawat/sqoop-mysql/Employees

A2.3.3. Results:
mysql> select 'Number of records with hire date blank (should become zero)' Note,count(*) Counts from employees_export where hire_date is null union  select 'Number of records (should get to 300024)' Note,count(*) from employees_export;

+-------------------------------------------------------------+--------+
| Note                                                        | Counts |
+-------------------------------------------------------------+--------+
| Number of records with hire date blank (should become zero) |      0 |
| Number of records (should get to 300024)                    | 300024 |
+-------------------------------------------------------------+--------+
2 rows in set (0.22 sec)

B. Exporting out of Hive into mysql in insert mode

B1. Prep work

B1.1. Create a table in mysql, employees database, that we will export a Hive partitioned table into

mysql> CREATE TABLE employees_export_hive (
  emp_no int(11) NOT NULL,
  birth_date date NOT NULL,
  first_name varchar(14) NOT NULL,
  last_name varchar(16) NOT NULL,
  hire_date date NOT NULL,
  gender enum('M','F') NOT NULL,
  PRIMARY KEY (emp_no)
);

B1.2. Create a table in mysql, employees database, that we will export a Hive non-partitioned table into

mysql> create table departments_export_hive as select * from departments;

mysql> delete from departments_export_hive;

B1.3. Hive table without partitions to use for the export

I'll run an import from mysql into Hive, that we will use to export back to mysql.
This is silly, but the intention is to learn to export, so bear with me... :)


$ sqoop import \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--table departments \
--direct \
-m 1 \
--hive-import \
--create-hive-table \
--hive-table departments_mysql \
--target-dir /user/hive/warehouse/employees \
--enclosed-by '\"' \
--fields-terminated-by , \
--escaped-by \\ \


This creates a table called departments_mysql with 9 records.

B1.4. Hive table with partitions to use for the export

Partition 1:

$ sqoop import \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--query 'select emp_no,birth_date,first_name,last_name,hire_date from employees where gender="M" AND $CONDITIONS'  \
--direct \
--split-by emp_no \
--hive-import \
--create-hive-table \
--hive-table employees_import_parts \
--hive-partition-key gender \
--hive-partition-value 'M' \
--optionally-enclosed-by '\"' \
--target-dir /user/hive/warehouse/employee-parts 


Partition 2:

$ sqoop import \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--query 'select emp_no,birth_date,first_name,last_name,hire_date from employees where gender="F" AND $CONDITIONS'  \
--direct \
-m 6 \
--split-by emp_no \
--hive-import \
--hive-overwrite \
--hive-table employees_import_parts \
--hive-partition-key gender \
--hive-partition-value 'F' \
--optionally-enclosed-by '\"' \ 
--target-dir /user/hive/warehouse/employee-parts_F 

Note: The --optionally-enclosed-by '\"' is a must without which the EMP_NO field was showing up as a null in Hive.

Files generated:

$ hadoop fs -ls -R /user/hive/warehouse/employees_import_parts | grep /part* | awk '{print $8}'

/user/hive/warehouse/employees_import_parts/gender=F/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00005
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00005

Record count by gender:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=F/part* | wc -l
120051

$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=M/* | wc -l
179973

Record count for employees in total:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/*/part* | wc -l
300024

B2. Exporting non-partitioned Hive table into mysql

Source:  hive-table departments_mysql
Destination: mysql-table departments_export_hive

B2.1. Source data:
hive> select * from departments_mysql;
OK
"d009" "Customer Service"
"d005" "Development"
"d002" "Finance"
"d003" "Human Resources"
"d001" "Marketing"
"d004" "Production"
"d006" "Quality Management"
"d008" "Research"
"d007" "Sales"
Time taken: 2.959 seconds


B2.2. sqoop command:
$ sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--table departments_export_hive  \
--direct \
--enclosed-by '\"' \
--export-dir /user/hive/warehouse/departments_mysql
.
.
.
13/06/04 11:25:27 INFO mapreduce.ExportJobBase: Transferred 1.0869 KB in 69.1858 seconds (16.0871 bytes/sec)
13/06/04 11:25:27 INFO mapreduce.ExportJobBase: Exported 9 records.


B2.3. Results:
mysql> select * from departments_export_hive;
+---------+--------------------+
| dept_no | dept_name          |
+---------+--------------------+
| d008    | Research           |
| d007    | Sales              |
| d004    | Production         |
| d006    | Quality Management |
| d002    | Finance            |
| d003    | Human Resources    |
| d001    | Marketing          |
| d009    | Customer Service   |
| d005    | Development        |
+---------+--------------------+
9 rows in set (0.00 sec)

Note: Without the "--enclosed by" argument, I found that the last character of the dept_no was getting picked up.  

B3. Exporting partitioned Hive table into mysql

Note 1: With Sqoop 1.4.2., we need to issue a sqoop statement for every partition individually.  
Note 2:  In the export, the partition key will not be inserted, you have to issue an update statement for the same.  

Source:  hive-table employees_import_parts
Destination: mysql-table employees_export_hive


B3.1. Sqoop command - export partition where gender is M:

$ sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--table employees_export_hive  \
--direct \
--enclosed-by '\"' \
--export-dir /user/hive/warehouse/employees_import_parts/gender=M

B3.2. Execute partition key update:

mysql> update employees_export_hive set gender='M' where (gender="" or gender is null);

Query OK, 179973 rows affected (1.01 sec)

B3.3. Export partition where gender is F:

$ sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--table employees_export_hive  \
--direct \
--enclosed-by '\"' \
--export-dir /user/hive/warehouse/employees_import_parts/gender=F

B3.4. Execute partition key update:

mysql> update employees_export_hive set gender='F' where (gender="" or gender is null);

Query OK, 120051 rows affected (1.02 sec)


C. Exporting out of Hive into mysql in update mode

C1. Prep work

C1.1. Issue the following update in mysql to the department table to try update functionality

mysql> update departments_export_hive set dept_name="Procrastrinating" where dept_no="d001";
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from departments_export_hive;
+---------+--------------------+
| dept_no | dept_name          |
+---------+--------------------+
| d002    | Finance            |
| d003    | Human Resources    |
| d001    | Procrastrinating   |
| d008    | Research           |
| d007    | Sales              |
| d009    | Customer Service   |
| d005    | Development        |
| d004    | Production         |
| d006    | Quality Management |
+---------+--------------------+
9 rows in set (0.00 sec)

C2. Sqoop export command:

$ sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--table departments_export_hive  \
--enclosed-by '\"' \
--update-key "dept_no" \
--update-mode updateonly \
--export-dir /user/hive/warehouse/departments_mysql

C3. Results:

mysql> select * from departments_export_hive;
+---------+--------------------+
| dept_no | dept_name          |
+---------+--------------------+
| d002    | Finance            |
| d003    | Human Resources    |
| d001    | Marketing          |
| d008    | Research           |
| d007    | Sales              |
| d009    | Customer Service   |
| d005    | Development        |
| d004    | Production         |
| d006    | Quality Management |
+---------+--------------------+
9 rows in set (0.00 sec)

D. Exporting out of Hive into mysql in upsert mode

This command did not work for me.  I found that with sqoop 1.4.2, sqoop cannot do an upsert.  I found documentation that this functionality is not supported for mysql.  I also read that it works for Oracle.

$ sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--table departments_export_hive  \
--enclosed-by '\"' \
--update-key "dept_no" \
--update-mode allowinsert \
--export-dir /user/hive/warehouse/departments_mysql


This concludes this blog.
The next blog covers some best practices.



41 comments:

  1. How to transfer the data from hdfs into mysql? I am using Hadoop 1.0.4 and sqoop 1.4.2 .. Please help me for transfer the data..
    I have done the moving the data from mysql to hdfs... Now I need hdfs into mysql...

    ReplyDelete
  2. Hi,
    I have covered the export in this blog.
    Its just that in the first step, I have done an import only to use the data to demonstrate the export. :)
    Confusing I guess.

    But its here...
    Section A2 onwards.

    Anagha

    ReplyDelete
  3. Is there any way to get desired number of column data from HDFS to sql table

    ReplyDelete
    Replies
    1. yes you can used --query command from sqoop to export selected columns

      Delete
    2. sqoop import \
      --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
      --split-by a.id --target-dir /user/foo/joinresults

      Delete
  4. Hi,

    Will you send me the export query from one hive table to multiple sql server tables using sqoop

    Thanks
    Naveen

    ReplyDelete
  5. Hi Anagha,

    I'm facing the following issue on exporting data from HDFS to MySQL


    java.lang.RuntimeException: java.io.IOException: WritableName can't load class: org.apache.hadoop.hive.ql.io.RCFile$KeyBuffer

    I/m not sure what I'm missing in here. Using the following sqoop command:

    sqoop export \
    --connect jdbc:mysql://hostname/dbname \
    --username uname \
    --password password \
    --table table_name \
    --staging-table stg \
    --clear-staging-table \
    -m 4 \
    --export-dir /user/hive/warehouse/dir

    Thanks for all your help !!

    --Sanjeev

    ReplyDelete
  6. Hi Anagha,

    i want to truncate the data from the target table before i export the new data into it,
    is it possible using sqoop commands?

    Thanks,
    Ghanendra

    ReplyDelete
  7. Hi, Can you please tell me why stage table and what does?

    Thanks
    Hareesh

    ReplyDelete
  8. Hello,
    How are you? You have a nice name.
    Using hive query:
    insert overwrite directory '/user/hadoop/tempdata1'
    > select * from actors;

    Got data into HDFS directory tempdata1 as field delimiters as 0001 and EOF as \n which are hive defaults.
    Need to upload this file into MySQL table . Can u please expand your thoughts on that?

    Does this statement works?

    ./sqoop export --connect jdbc:mysql://192.168.56.1/sakila --username biadmin -password biadmin --table filmacted_byactor --export-dir /user/hadoop/tempdata1 --input-fields-terminated-by '/0001' --input-lines-terminated-by '\n' -m 1

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. how export a table from hive to mysql new table (without creating a table schema in mysql)

    ReplyDelete
  11. you can visit to the below url , they are offering very good videos on Hadoop:

    For free videos from previous sessions refer:
    http://hadoopbigdatatutorial.com/hadoop-training/big-data-tutorial

    ReplyDelete
  12. Good Article. While searching in google i also found some sqoop commands practical workout contents in http://www.geoinsyssoft.com. Thank You Very much.

    ReplyDelete
  13. Nice blog. Really helpful for learningHadoop and keep update on some more tutorials….. I liked your blog.

    ReplyDelete
  14. I have seen a lot of blogs and Info. on other Blogs and Web sites But in this Hadoop Blog Information is useful very thanks for sharing it........

    ReplyDelete
  15. Nice blog and absolutely outstanding. You can do something much better but i still say this perfect.Keep trying for the best. Hadoop development services in India

    ReplyDelete
  16. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  17. The demand for Hadoop Developer is constantly rising because of the massive data that is being generated every day according to the latest survey, this is right time to join in Hadoop training in Hyderabad Improve your career prospects by exploring your career path.

    ReplyDelete
  18. This comment has been removed by the author.

    ReplyDelete
  19. 6-week summer course in Noida - 6 weeks The summer course plays a crucial role in shaping the career of young aspiring / informatics students. This training has been specifically introduced so that students can become familiar with current industrial culture and industrial needs. Webtrackker technology offers a 6-month training program for students / graduates that includes small and large projects.
    6-week summer course in Noida

    ReplyDelete
  20. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  21. Each department of CAD have specific programmes which, while completed could provide you with a recognisable qualification that could assist you get a job in anything design enterprise which you would really like.

    AutoCAD training in Noida

    AutoCAD training institute in Noida


    Best AutoCAD training institute in Noida

    ReplyDelete
  22. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    Big Data Hadoop training in electronic city

    ReplyDelete
  23. Like different elements of India, numerous oil and spices usually cross into making food. This effects in substances getting caught to the partitions of the filter out.
    Visit here
    http://kitchenware.ml
    Best kitchen chimney installation and service
    Auto clean chimney sevice in Faridabad

    ReplyDelete
  24. Thank you for your informative post.. This is interesting to read. Keep posting like this! Thank you..
    DevOps Online Training

    ReplyDelete