Hooked on Hadoop: Apache Sqoop - Part 3: Export from HDFS/Hive into mysql

Tuesday, June 4, 2013

Apache Sqoop - Part 3: Export from HDFS/Hive into mysql

What's in the blog?

My notes on exporting data out of HDFS and Hive into mySQL with examples that one can try out. My first blog on Apache Sqoop covers mysql installation and sample data setup. Some of the examples in this blog reference the mysql sample data, from my first blog on Sqoop.

Versions covered:
Sqoop (1.4.2) with Mysql (5.1.69 )

Topics covered:

A. Exporting out of HDFS into mysql
A1. Sample data prep
A2.1. Export in insert mode, using staging table
A2.2. Export in update mode
A2.3. Export in upsert mode

B. Exporting out of Hive into mysql - in insert mode
B1. Sample data prep
B2. Exporting non-partitioned Hive table into mysql
B3. Exporting partitioned Hive table into mysql

C. Exporting out of Hive into mysql in update mode
C1. Sample data prep
C2. Sqoop export command for updates

D. Exporting out of Hive into mysql in upsert mode

My blogs on Sqoop:

Blog 1: Import from mysql into HDFS
Blog 2: Import from mysql into Hive
Blog 3: Export from HDFS and Hive into mysql
Blog 4: Sqoop best practices
Blog 5: Scheduling of Sqoop tasks using Oozie
Blog 6: Sqoop2

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

Apache Sqoop documentation on the "export" tool

Exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Sqoop uses the multi-row INSERT syntax to insert up to 100 records per statement. Every 100 statements, the current transaction within a writer task is committed, causing a commit every 10,000 rows. This ensures that transaction buffers do not grow without bound, and cause out-of-memory conditions. Therefore, an export is not an atomic process. Partial results from the export will become visible before the export is complete.

Exports may fail for a number of reasons:

Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes)
Attempting to INSERT a row which violates a consistency constraint (for example, inserting a duplicate primary key value)
Attempting to parse an incomplete or malformed record from the HDFS source data
Attempting to parse records using incorrect delimiters
Capacity issues (such as insufficient RAM or disk space)

If an export map task fails due to these or other reasons, it will cause the export job to fail. The results of a failed export are undefined. Each export map task operates in a separate transaction. Furthermore, individual map tasks commit their current transaction periodically. If a task fails, the current transaction will be rolled back. Any previously-committed transactions will remain durable in the database, leading to a partially-complete export.

A. Exporting out of HDFS into mysql

A1. Prep work

A1.1. Create a table in mysql:

mysql> CREATE TABLE employees_export (

emp_no int(11) NOT NULL,

birth_date date NOT NULL,

first_name varchar(14) NOT NULL,

last_name varchar(16) NOT NULL,

gender enum('M','F') NOT NULL,

hire_date date NOT NULL,

PRIMARY KEY (emp_no)

);

A1.2. Create a stage table in mysql:

mysql > CREATE TABLE employees_exp_stg (

emp_no int(11) NOT NULL,

birth_date date NOT NULL,

first_name varchar(14) NOT NULL,

last_name varchar(16) NOT NULL,

gender enum('M','F') NOT NULL,

hire_date date NOT NULL,

PRIMARY KEY (emp_no)

);

A1.3 Import some data into HDFS:

sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,birth_date,first_name,last_name,gender,hire_date from employees where $CONDITIONS' \
--split-by EMP_NO \
--direct \
--target-dir /user/airawat/sqoop-mysql/Employees

A2. Export functionality

A2.1. Export in insert mode, using staging table

Pretty straight-forward...as you can see.

A2.1.1. Sqoop command

$ sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \

--username MyUID \

--password myPWD \

--table employees_export \

--staging-table employees_exp_stg \

--clear-staging-table \

-m 4 \

--export-dir /user/airawat/sqoop-mysql/Employees

Note: Even without the clear staging table argument, I found that the staging table was empty, however, the command output clearly indicates that the staging table was used.

A2.2. Export in update mode

A2.2.1. Prep:
I am going to set hire date to null for some records, for trying this functionality out.

mysql> update employees_export set hire_date = null where emp_no >400000;

Query OK, 99999 rows affected, 65535 warnings (1.26 sec)

Rows matched: 99999 Changed: 99999 Warnings: 99999

A2.2.2. Sqoop command:
Next, we will export the same data to the same table, and see if the hire date is updated.

$ sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \

--username myUID \

--password myPWD \

--table employees_export \

--direct \

--update-key emp_no \

--update-mode updateonly \

--export-dir /user/airawat/sqoop-mysql/Employees

A2.2.3. Results:
mysql> select count(*) from employees_export where hire_date is null;
+----------+
| count(*) |
+----------+
| 0 |
+----------+
1 row in set (0.22 sec)

A2.3. Export in upsert mode

Upsert = insert if does not exist, update if exists.

Note: Mysql direct connector does not work for mysql.

A2.3.1. Prep:

I will update a few records in my sql to have null as hire date; Will delete a few records; Will then run an upsert, tro try out this functionality.

mysql> update employees_export set hire_date = null where emp_no >200000;

mysql> delete from employees_export where emp_no >400000;

mysql> select 'Number of records with hire date blank (should become zero)' Note,count(*) Counts from employees_export where hire_date is null

-> union

-> select 'Number of records (should get to 300024)' Note,count(*) from employees_export;

+-------------------------------------------------------------+--------+

| Note | Counts |

+-------------------------------------------------------------+--------+

| Number of records with hire date blank (should become zero) | 100000 |

| Number of records (should get to 300024) | 200025 |

+-------------------------------------------------------------+--------+

A2.3.2. Sqoop command:

sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \

--username myUID \

--password myPWD \

--table employees_export \

--update-key emp_no \

--update-mode allowinsert \

--export-dir /user/airawat/sqoop-mysql/Employees

A2.3.3. Results:

mysql> select 'Number of records with hire date blank (should become zero)' Note,count(*) Counts from employees_export where hire_date is null union select 'Number of records (should get to 300024)' Note,count(*) from employees_export;

+-------------------------------------------------------------+--------+

| Note | Counts |

+-------------------------------------------------------------+--------+

| Number of records with hire date blank (should become zero) | 0 |

| Number of records (should get to 300024) | 300024 |

+-------------------------------------------------------------+--------+

2 rows in set (0.22 sec)

B. Exporting out of Hive into mysql in insert mode

B1. Prep work

B1.1. Create a table in mysql, employees database, that we will export a Hive partitioned table into

mysql> CREATE TABLE employees_export_hive (

emp_no int(11) NOT NULL,

birth_date date NOT NULL,

first_name varchar(14) NOT NULL,

last_name varchar(16) NOT NULL,

hire_date date NOT NULL,
gender enum('M','F') NOT NULL,

PRIMARY KEY (emp_no)

);

B1.2. Create a table in mysql, employees database, that we will export a Hive non-partitioned table into

mysql> create table departments_export_hive as select * from departments;

mysql> delete from departments_export_hive;

B1.3. Hive table without partitions to use for the export

I'll run an import from mysql into Hive, that we will use to export back to mysql.

This is silly, but the intention is to learn to export, so bear with me... :)

$ sqoop import \

--connect jdbc:mysql://airawat-mysqlserver-node/employees \

--username myUID \

--password myPWD \

--table departments \

--direct \

-m 1 \

--hive-import \

--create-hive-table \

--hive-table departments_mysql \

--target-dir /user/hive/warehouse/employees \

--enclosed-by '\"' \

--fields-terminated-by , \

--escaped-by \\ \

This creates a table called departments_mysql with 9 records.

B1.4. Hive table with partitions to use for the export

Partition 1:

$ sqoop import \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--query 'select emp_no,birth_date,first_name,last_name,hire_date from employees where gender="M" AND $CONDITIONS' \
--direct \
--split-by emp_no \
--hive-import \
--create-hive-table \
--hive-table employees_import_parts \
--hive-partition-key gender \
--hive-partition-value 'M' \
--optionally-enclosed-by '\"' \
--target-dir /user/hive/warehouse/employee-parts

Partition 2:

$ sqoop import \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--query 'select emp_no,birth_date,first_name,last_name,hire_date from employees where gender="F" AND $CONDITIONS' \
--direct \
-m 6 \
--split-by emp_no \
--hive-import \
--hive-overwrite \
--hive-table employees_import_parts \
--hive-partition-key gender \
--hive-partition-value 'F' \
--optionally-enclosed-by '\"' \
--target-dir /user/hive/warehouse/employee-parts_F

Note: The --optionally-enclosed-by '\"' is a must without which the EMP_NO field was showing up as a null in Hive.

Files generated:

$ hadoop fs -ls -R /user/hive/warehouse/employees_import_parts | grep /part* | awk '{print $8}'

/user/hive/warehouse/employees_import_parts/gender=F/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=F/part-m-00005
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00000
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00001
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00002
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00003
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00004
/user/hive/warehouse/employees_import_parts/gender=M/part-m-00005

Record count by gender:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=F/part* | wc -l
120051

$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/gender=M/* | wc -l
179973

Record count for employees in total:
$ hadoop fs -cat /user/hive/warehouse/employees_import_parts/*/part* | wc -l

300024

B2. Exporting non-partitioned Hive table into mysql

Source: hive-table departments_mysql

Destination: mysql-table departments_export_hive

B2.1. Source data:

hive> select * from departments_mysql;

"d009" "Customer Service"

"d005" "Development"

"d002" "Finance"

"d003" "Human Resources"

"d001" "Marketing"

"d004" "Production"

"d006" "Quality Management"

"d008" "Research"

"d007" "Sales"

Time taken: 2.959 seconds

B2.2. sqoop command:

$ sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \

--username myUID \

--password myPWD \

--table departments_export_hive \

--direct \
--enclosed-by '\"' \

--export-dir /user/hive/warehouse/departments_mysql
.
.
.
13/06/04 11:25:27 INFO mapreduce.ExportJobBase: Transferred 1.0869 KB in 69.1858 seconds (16.0871 bytes/sec)
13/06/04 11:25:27 INFO mapreduce.ExportJobBase: Exported 9 records.

B2.3. Results:

mysql> select * from departments_export_hive;

+---------+--------------------+

| dept_no | dept_name |

+---------+--------------------+

| d008 | Research |

| d007 | Sales |

| d004 | Production |

| d006 | Quality Management |

| d002 | Finance |

| d003 | Human Resources |

| d001 | Marketing |

| d009 | Customer Service |

| d005 | Development |

+---------+--------------------+

9 rows in set (0.00 sec)

Note: Without the "--enclosed by" argument, I found that the last character of the dept_no was getting picked up.

B3. Exporting partitioned Hive table into mysql

Note 1: With Sqoop 1.4.2., we need to issue a sqoop statement for every partition individually.
Note 2: In the export, the partition key will not be inserted, you have to issue an update statement for the same.

Source: hive-table employees_import_parts

Destination: mysql-table employees_export_hive

B3.1. Sqoop command - export partition where gender is M:

$ sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--table employees_export_hive \
--direct \
--enclosed-by '\"' \
--export-dir /user/hive/warehouse/employees_import_parts/gender=M

B3.2. Execute partition key update:

mysql> update employees_export_hive set gender='M' where (gender="" or gender is null);

Query OK, 179973 rows affected (1.01 sec)

B3.3. Export partition where gender is F:

$ sqoop export \
--connect jdbc:mysql://airawat-mysqlserver-node/employees \
--username myUID \
--password myPWD \
--table employees_export_hive \
--direct \
--enclosed-by '\"' \
--export-dir /user/hive/warehouse/employees_import_parts/gender=F

B3.4. Execute partition key update:

mysql> update employees_export_hive set gender='F' where (gender="" or gender is null);

Query OK, 120051 rows affected (1.02 sec)

C. Exporting out of Hive into mysql in update mode

C1. Prep work

C1.1. Issue the following update in mysql to the department table to try update functionality

mysql> update departments_export_hive set dept_name="Procrastrinating" where dept_no="d001";

Query OK, 1 row affected (0.00 sec)

Rows matched: 1 Changed: 1 Warnings: 0

mysql> select * from departments_export_hive;

+---------+--------------------+

| dept_no | dept_name |

+---------+--------------------+

| d002 | Finance |

| d003 | Human Resources |

| d001 | Procrastrinating |

| d008 | Research |

| d007 | Sales |

| d009 | Customer Service |

| d005 | Development |

| d004 | Production |

| d006 | Quality Management |

+---------+--------------------+

9 rows in set (0.00 sec)

C2. Sqoop export command:

$ sqoop export \

--connect jdbc:mysql://airawat-mysqlserver-node/employees \

--username myUID \

--password myPWD \

--table departments_export_hive \

--enclosed-by '\"' \

--update-key "dept_no" \

--update-mode updateonly \

--export-dir /user/hive/warehouse/departments_mysql

C3. Results:

mysql> select * from departments_export_hive;

+---------+--------------------+

| dept_no | dept_name |

+---------+--------------------+

| d002 | Finance |

| d003 | Human Resources |

| d001 | Marketing |

| d008 | Research |

| d007 | Sales |

| d009 | Customer Service |

| d005 | Development |

| d004 | Production |

| d006 | Quality Management |

+---------+--------------------+

9 rows in set (0.00 sec)

D. Exporting out of Hive into mysql in upsert mode

This command did not work for me. I found that with sqoop 1.4.2, sqoop cannot do an upsert. I found documentation that this functionality is not supported for mysql. I also read that it works for Oracle.

$ sqoop export \

--connect jdbc:mysql://airawat-mysqlserver-node/employees \

--username myUID \

--password myPWD \

--table departments_export_hive \

--enclosed-by '\"' \

--update-key "dept_no" \

--update-mode allowinsert \

--export-dir /user/hive/warehouse/departments_mysql

This concludes this blog.

The next blog covers some best practices.