Friday, May 31, 2013

Apache Sqoop - Part 1: Import data from mysql into HDFS

Apache Sqoop


Apache Sqoop is a tool designed for efficiently transferring bulk data in a distributed manner between Apache Hadoop and structured datastores such as relational databases, enterprise data warehouses, and NoSQL systems.  Sqoop can be used to import data into HBase, HDFS and Hive and out of it into RDBMS, in an automated fashion, leveraging Oozie for scheduling.  It has a connector based architecture that supports plugins that provide connectivity to new external systems.

Behind the scenes, the dataset being transferred is split into partitions and map only jobs are launched for each partition with the mappers managing transferring the dataset assigned to it.  Sqoop uses the database metadata to infer the types, and handles the data in a type safe manner.


The following diagrams are from the Apache documentation...


Import process:




Export process:


Supported databases:

Database    version    --direct support?    connect string matches
HSQLDB    1.8.0+        No   jdbc:hsqldb:*//
MySQL    5.0+        Yes   jdbc:mysql://
Oracle    10.2.0+        No    jdbc:oracle:*//
PostgreSQL    8.3+        Yes (import only)    jdbc:postgresql://

A) What's covered in this post:

This blog is mostly notes for myself from what I have learned - with examples that can be tried out.

I used the documentation at the link below for my self-study:
http://archive.cloudera.com/cdh4/cdh/4/sqoop-1.4.2-cdh4.2.0/SqoopUserGuide.html

Versions covered:
Sqoop (1.4.2) with Mysql (5.1.69 ) 

Topics covered:
B) MySql database setup and sample data
C) Installing Sqoop
D) Download save mySQl driver 
E) sqoop list commands
F) Importing data into HDFS
    F1. Importing a table into HDFS - basic import
    F2. Executing imports with an options file for static information
    F3. Import all rows of a table in mySQL, but specific columns of the table
    F4. Import all columns, filter rows using where clause
    F5. Import with a free form query without where clause
    F6. Import with a free form query with where clause
    F7. Controlling parallelism and what's with the $CONDITIONS in the where clause?
    F8. Direct connector
    F9. Import formats
    F10. Split by
    F11. Boundary query
    F12. Fetch size
    F13. Compression
    F14. Incremental imports
    F15. Output line formatting options
    F16. Import all tables
G.  Other functionality
    G1. Direct and quick queries or inserts and updates with sqoop eval
    G2. Sqoop code-gen
    G.3. Sqoop merge


Subsequent blogs will cover importing into Hive and HBase, and export out of HDFS into mysql.


My blogs on Sqoop:

Blog 1: Import from mysql into HDFS
Blog 2: Import from mysql into Hive
Blog 3: Export from HDFS and Hive into mysql
Blog 4: Sqoop best practices
Blog 5: Scheduling of Sqoop tasks using Oozie
Blog 6: Sqoop2

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.


B) MySql Setup:

Step 1: Installing MySql and logging onto MySql

1. Install MySql server
sudo yum install mysql-server

2. To start MySql..
sudo service mysqld start

3. Set root password
Run the command below and follow steps:
sudo /usr/bin/mysql_secure_installation

4. Install MySql client
sudo yum install mysql

5. Check what got installed
sudo rpm -qa|grep -i mysql

Here is the output from my installation:
----------------------------------------------------
mysql-server-5.1.69-1.el6_4.x86_64
mysql-5.1.69-1.el6_4.x86_64
perl-DBD-MySQL-4.013-3.el6.x86_64
mysql-libs-5.1.69-1.el6_4.x86_64
----------------------------------------------------

6. Verify if MySql is running
sudo service mysqld status

7. Configure MySql services to be enabled at boot time..
sudo chkconfig --levels 235 mysqld on

8. Login to MySql
To login as root, from Unix command prompt, enter the command below; Enter password;
mysql -u root -p

9. Create a user
create user 'airawat'@'localhost' identified by 'myPassword';

10. Login as user
mysql -u airawat -p

Step 2: Load sample data

a) Load the sample data available at the link below, following instructions detailed at the site.
http://dev.mysql.com/doc/employee/en/index.html

b) Use the following command to load the employee database, while logged in as root..
mysql -u root -p -t < employees.sql

c) Validate the install by running through the following commands in the MySql shell-

show databases; --Should see employees
use employees; --Use the employees database
show tables;--This lists various tables that can be validated against instructions for the sample data setup

d) Next, still logged in as root, grant the user access to the database.
grant ALL on employees.* to airawat@localhost identified by 'myPassword';

e) Exit out from root and login as the user you set up, and get familiar with the tables
mysql -u airawat -p employees

Now we have the data available to start playing with Sqoop.


C) Installing Sqoop

Refer the Apache Sqoop website for installation.
Sqoop is a client-side tool, so it needs to be installed only on the client on which you wish to use Sqoop.


D) Save mySQl driver 

Download the-
            mysql driver called mysql-connector-java-5.1.25-bin.jar, 
            from http://dev.mysql.com/downloads/connector/j/
            and place it at $SQOOP_HOME/lib on node running sqoop

Note: If you installed CDH, copy to 
/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib/


E) sqoop list commands

Run the commands on the Unix prompt, on the node where you have sqoop installed.

E.1. List databases

Lists databases in your mysql database.

$ sqoop list-databases --connect jdbc:mysql://<<mysql-server>>/employees --username airawat --password myPassword
.
.
.
13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
employees
test

E.2. List tables

Lists tables in your mysql database.

$ sqoop list-tables --connect jdbc:mysql://<<mysql-server>>/employees --username airawat --password myPassword
.
.
.
13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
departments
dept_emp
dept_manager
employees
employees_exp_stg
employees_export
salaries
titles

F) Importing data in MySql into HDFS 

Replace "airawat-mySqlServer-node" with the host name of the node running mySQL server, replace login credentials and target directory.

F1. Importing a table into HDFS - basic import

$ sqoop import \
--connect jdbc:mysql://airawat-mySqlServer-node/employees \
--username myUID \
--password myPWD \
--table employees \
-m 1 \
--target-dir /user/airawat/sqoop-mysql/employees
.
.
.
.9139 KB/sec)
13/05/31 22:32:25 INFO mapreduce.ImportJobBase: Retrieved 300024 records

F2. Executing imports with an options file for static information

Rather than repeat the import command along with connection related input required, each time, you can pass an options file as an argument to sqoop.  Create a text file, as follows, and save it someplace, locally on the node you are running the sqoop client on.  

Note: This blog does not cover it, but you can create and use a password file as well, that you can pass as argument --password-file <<filename>>.

F2.1. Sample Options file:
___________________________________________________________________________
$ vi SqoopImportOptions.txt 
#
#Options file for sqoop import
#

import
--connect
jdbc:mysql://airawat-mySqlServer-node/employees
--username
myUID
--password
myPwd

#
#All other commands should be specified in the command line
___________________________________________________________________________

F2.2. Command to execute import, using an options file:


Note: Provide the proper path for the options file

F2.2.1. The command

$ sqoop --options-file SqoopImportOptions.txt \
--table departments \
-m 1 \
--target-dir /user/airawat/sqoop-mysql/departments
.
.
.
13/05/31 22:48:55 INFO mapreduce.ImportJobBase: Transferred 153 bytes in 26.2453 seconds (5.8296 bytes/sec)
13/05/31 22:48:55 INFO mapreduce.ImportJobBase: Retrieved 9 records.


-m argument is to specify number of mappers.  The department table has a handful of records, so I am setting it to 1.


F2.2.2. Files created in HDFS:

$ hadoop fs -ls -R sqoop-mysql/
drwxr-xr-x   - airawat airawat          0 2013-05-31 22:48 sqoop-mysql/departments
-rw-r--r--   3 airawat airawat          0 2013-05-31 22:48 sqoop-mysql/departments/_SUCCESS
drwxr-xr-x   - airawat airawat          0 2013-05-31 22:48 sqoop-mysql/departments/_logs
drwxr-xr-x   - airawat airawat          0 2013-05-31 22:48 sqoop-mysql/departments/_logs/history
-rw-r--r--   3 airawat airawat      79467 2013-05-31 22:48 sqoop-mysql/departments/_logs/history/cdh-jt01_1369839495962_job_201305290958_0062_conf.xml
-rw-r--r--   3 airawat airawat      12441 2013-05-31 22:48 sqoop-mysql/departments/_logs/history/job_201305290958_0062_1370058514473_ airawat_departments.jar
-rw-r--r--   3 airawat airawat        153 2013-05-31 22:48 sqoop-mysql/departments/part-m-00000


F2.2.3. Data file contents:

$ hadoop fs -cat sqoop-mysql/departments/part-m-00000 | more

d009,Customer Service
d005,Development
d002,Finance
d003,Human Resources
d001,Marketing
d004,Production
d006,Quality Management
d008,Research
d007,Sales


F3. Import all rows of a table in mySQL, but specific columns of the table

Note: This did not work for me..I am merely providing the statement...will try back to see if this  is a defect in my syntax or a sqoop defect..might be related to case of the column names defined in the database versus the sqoop import statement


$ sqoop --options-file SqoopImportOptions.txt \
--table  dept_emp \
--columns “EMP_NO,DEPT_NO,FROM_DATE,TO_DATE” \
--as-textfile \
-m 1 \
--target-dir /user/airawat/sqoop-mysql/DeptEmp

Error: 
13/05/31 23:01:53 ERROR util.SqlTypeMap: It seems like you are looking up a column that does not
13/05/31 23:01:53 ERROR util.SqlTypeMap: exist in the table. Please ensure that you've specified
13/05/31 23:01:53 ERROR util.SqlTypeMap: correct column names in Sqoop options.
13/05/31 23:01:53 ERROR tool.ImportTool: Imported Failed: column not found: ‘EMP_NO'


F4. Import all columns, filter rows using where clause

 $ sqoop --options-file SqoopImportOptions.txt \
--table employees  \
--where "emp_no > 499948" \
--as-textfile \
-m 1 \
--target-dir /user/airawat/sqoop-mysql/employeeGtTest

F5. Import with a free form query without where clause

 $ sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where  $CONDITIONS' \
-m 1 \
--target-dir /user/airawat/sqoop-mysql/employeeFrfrmQrySmpl2

(Case of the column needs to match that used to create table, or else the import fails)

F6. Import with a free form query with where clause

$ sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where EMP_NO < 20000 AND $CONDITIONS' \
-m 1 \
--target-dir /user/airawat/sqoop-mysql/employeeFrfrmQry1

F7. Controlling parallelism and what's with the $CONDITIONS in the where clause?

This section is straight from the Apache User Guide.

$Conditions

If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.

Controlling parallelism
Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.

When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id

Note: Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.

F8. Direct connector

By default, the import process will use JDBC.  Performance can be improved by using database specific and native data movement tools. Like for example, MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly. When we provide the argument, "--direct" we are specifying that Sqoop should attempt the direct import channel. 

Note: Currently, direct mode does not support imports of large object columns.

$ sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where EMP_NO < 20000 AND $CONDITIONS' \
-m 1 \
--direct \
--target-dir /user/airawat/sqoop-mysql/employeeUsingDirect

F9. Import formats

With mysql, text file is the only format supported;  Avro and Sequence file formatted imports are feasible through other RDBMS - refer Apache Sqoop documentation for more information.

Text file is the default format.

F10. Split by

Refer section on controlling parallelism..

$ sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where $CONDITIONS' \
--split-by EMP_NO \
--direct \
--target-dir /user/airawat/sqoop-mysql/SplitByExampleImport

F11. Boundary query

Again related to controlling parallelism..


By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.
(Note: This did not work for me..got an error;  Need to try again, to see if it is an error on my end, or sqoop related)

$ sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where $CONDITIONS' \
--boundary-query “SELECT MIN(EMP_NO), MAX(EMP_NO) from employees” \
--split-by EMP_NO \
--direct \
--target-dir /user/airawat/sqoop-mysql/BoundaryQuerySample


F12. Fetch size

This argument specifies to sqoop the number of entries to read from database at once.
$ sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where $CONDITIONS' \
--fetch-size=50000 \
--split-by EMP_NO \
--direct \
--target-dir /user/airawat/sqoop-mysql/FetchSize

F13. Compression

Use the --compress argument to enable compression; If you dont specify a compression codec (--compression-codec), the default gzip will be used.

The command:
$ sqoop --options-file SqoopImportOptions.txt \

--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where $CONDITIONS' \
-z \
--split-by EMP_NO \
--direct \
--target-dir /user/airawat/sqoop-mysql/CompressedSample

The output:
$ hadoop fs -ls -R sqoop-mysql/CompressedSample | grep part*

-rw-r--r--   3 airawat airawat     896377 2013-05-31 23:49 sqoop-mysql/CompressedSample/part-m-00000.gz
-rw-r--r--   3 airawat airawat     499564 2013-05-31 23:49 sqoop-mysql/CompressedSample/part-m-00001.gz
-rw-r--r--   3 airawat airawat     409199 2013-05-31 23:49 sqoop-mysql/CompressedSample/part-m-00002.gz
-rw-r--r--   3 airawat airawat     907330 2013-05-31 23:49 sqoop-mysql/CompressedSample/part-m-00003.gz


F14. Incremental imports

F14.1. Prep

Import part of the employee table ahead of time..

The command:

$ sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where EMP_NO < 15000 AND $CONDITIONS' \
--split-by EMP_NO \
--direct \
--target-dir /user/airawat/sqoop-mysql/IncrementalImports

The number of records imported:
$ hadoop fs -ls -R sqoop-mysql/IncrementalImports |grep part* | awk '{print $8}' |xargs hadoop fs -cat | wc -l

4999

F14.2. Run the incremental import

The three arguments highlighted in yellow need to be specified.

Arguments        
--check-column (col): Specifies the column to be examined when determining which rows to import.
--incremental (mode): Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value): Specifies the maximum value of the check column from the previous import.

Command

$ sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where $CONDITIONS' \
--check-column EMP_NO \
--incremental append \
--last-value 14999 \
--split-by EMP_NO \
--direct \
--target-dir /user/airawat/sqoop-mysql/IncrementalImports

Record count in mysql:

mysql> select count(*) from employees;

+----------+
| count(*) |
+----------+
|   300024 |
+----------+
1 row in set (0.12 sec)

Record count in HDFS:

$ hadoop fs -ls -R sqoop-mysql/IncrementalImports |grep part* | awk '{print $8}' |xargs hadoop fs -cat | wc -l

300024

F15. Output line formatting options


Refer Apache sqoop documentation for more on this topic..


$ sqoop --options-file SqoopImportOptions.txt \
--query 'select EMP_NO,FIRST_NAME,LAST_NAME from employees where $CONDITIONS' \
--fields-terminated-by , \
--escaped-by \\ \
--enclosed-by '\"' \
--split-by EMP_NO \
--direct \
--target-dir /user/airawat/sqoop-mysql/LineFormattingOptions

There is an argument called "--mysql-delimiters".  It uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: '

F16. Import all tables


Command:
$ sqoop --options-file SqoopImportAllTablesOptions.txt \
--direct \
--warehouse-dir sqoop-mysql/EmployeeDatabase

Options file content:
$ more SqoopImportAllTablesOptions.txt
______________________________________________

#
#Options file for sqoop import
#

import-all-tables
--connect
jdbc:mysql://airawat-mySqlServer-node/employees
--username
myUID
--password
myPWD

#
#All other commands should be specified in the command line
______________________________________________


Files generated:
$ hadoop fs -ls -R sqoop-mysql/EmployeeDatabase/*/part* | awk '{print $8}'


sqoop-mysql/EmployeeDatabase/departments/part-m-00000
sqoop-mysql/EmployeeDatabase/departments/part-m-00001
sqoop-mysql/EmployeeDatabase/departments/part-m-00002
sqoop-mysql/EmployeeDatabase/departments/part-m-00003
sqoop-mysql/EmployeeDatabase/dept_emp/part-m-00000
sqoop-mysql/EmployeeDatabase/dept_emp/part-m-00001
sqoop-mysql/EmployeeDatabase/dept_emp/part-m-00002
sqoop-mysql/EmployeeDatabase/dept_emp/part-m-00003
sqoop-mysql/EmployeeDatabase/dept_manager/part-m-00000
sqoop-mysql/EmployeeDa...........

G1. Direct and quick queries or inserts and updates with sqoop eval

From the Apache Sqoop user guide - "The eval tool allows users to quickly run simple SQL queries against a database; results are printed to the console. This allows users to preview their import queries to ensure they import the data they expect.


G1.1. Query:

$ sqoop eval --connect jdbc:mysql://airawat-mySqlServer-node/employees \
--username myUID \
--password myPWD \
--query "select * from employees limit 2"

---------------------------------------------------------------------------------
| emp_no      | birth_date | first_name     | last_name        | gender | hire_date  |
---------------------------------------------------------------------------------
| 10001       | 1953-09-02 | Georgi         | Facello          | M | 1986-06-26 |
| 10002       | 1964-06-02 | Bezalel        | Simmel           | F | 1985-11-21 | 

G1.2. Insert:

sqoop eval --connect jdbc:mysql://airawat-mySqlServer-node/employees \
--username myUID \
--password myPWD \
-e "insert into employees_export values(550000,'1977-08-08','Mouse','Mickey','M','1999-04-12')"

G2. Sqoop code-gen

The codegen tool generates Java classes which encapsulate and interpret imported records. The Java definition of a record is instantiated as part of the import process, but can also be performed separately. For example, if Java source is lost, it can be recreated. New versions of a class can be created which use different delimiters between fields, and so on.


$ hadoop fs -mkdir sqoop-mysql/jars

$ sqoop codegen --connect jdbc:mysql://airawat-mySqlServer-node/employees \
--username myUID \
--password myPWD \
--table employees \
--outdir /user/airawat/sqoop-mysql/jars

--Note: Sqoop could not create directory; I created it and tried again, it failed yet again.  Need to look into this.

13/05/31 16:19:24 ERROR orm.CompilationManager: Could not make directory: /user/airawat/sqoop-mysql/jars
13/05/31 16:19:24 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-airawat/compile/c88afaebb5744b4b2e45b6119540834a/employees.jar


G3. Sqoop merge

The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.

G3.1. Create test data in Mysql

G3.1.1 - Initial dataset

mysql> create table initial_emp as
select emp_no,birth_date, first_name,last_name,gender,hire_date  from employees where emp_no <= 100000
union
select emp_no,birth_date, 'null' as first_name,last_name,gender,hire_date  from employees where emp_no > 100000 and emp_no <= 300000;

--I created a table with fewer records than the employee table, and with some nulls for first name; My next import will include the records from the employee table with emp_no > 100000 and <= 300000, and also records with emp_no > 300000; With the Merge, we should see the final dataset with the full employee table;

G3.1.2 - Incremental dataset

mysql> create table final_emp as
select emp_no,birth_date,first_name,last_name,gender,hire_date  from employees where emp_no > 100000 and emp_no <= 300000
Union
select emp_no,birth_date,first_name,last_name,gender,hire_date  from employees where emp_no > 300000;

G3.2. Import the first dataset into HDFS


$ sqoop --options-file SqoopImportOptions.txt \
--query 'select emp_no,birth_date, first_name,last_name,gender,hire_date  from initial_emp where $CONDITIONS' \
--split-by EMP_NO \
--direct \
--target-dir /user/airawat/sqoop-mysql/InitialDataSet

hadoop fs -ls -R sqoop-mysql/InitialDataSet |grep part* | awk '{print $8}' |xargs hadoop fs -cat | wc -l

200024

$ hadoop fs -ls -R sqoop-mysql/InitialDataSet |grep part* | awk '{print $8}' | xargs hadoop fs -cat | grep 'null' | wc -l

110024

G3.3. Import the last dataset into HDFS


$ sqoop --options-file SqoopImportOptions.txt \
--query 'select emp_no,birth_date,first_name,last_name,gender, hire_date  from final_emp where $CONDITIONS' \
--split-by EMP_NO \
--direct \
--target-dir /user/airawat/sqoop-mysql/FinalDataSet


$ hadoop fs -ls -R sqoop-mysql/FinalDataSet |grep part* | awk '{print $8}' |xargs hadoop fs -cat | wc -l

210024


G3.4. Generate jar and class file for employee table


$ sqoop codegen --connect jdbc:mysql://cdh-dev01/employees \
--username myUID \
--password myPWD \
--table employees \
--outdir /user/airawat/sqoop-mysql/jars
.
.
.
13/06/03 10:29:18 ERROR orm.CompilationManager: Could not make directory: /user/airawat/sqoop-mysql/jars
13/06/03 10:29:18 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-airawat/compile/879394521045bc924ad9321fe46374bc/employees.jar

Files created:
$ ls /tmp/sqoop-airawat/compile/879394521045bc924ad9321fe46374bc/
employees.class  employees.jar  employees.java

Copy files to your home directory:
cp /tmp/sqoop-airawat/compile/879394521045bc924ad9321fe46374bc/* .

G3.4. Execute the merge


$ sqoop merge --new-data sqoop-mysql/FinalDataSet --onto sqoop-mysql/InitialDataSet --target-dir sqoop-mysql/MergedDataSet \
    --jar-file employees.jar --class-name employees --merge-key emp_no

Note: If I tried running this command with “EMP_NO” as merge-key instead of “emp-no”, I got errors.

$ hadoop fs -ls -R sqoop-mysql/MergedDataSet |grep part* | awk '{print $8}' |xargs hadoop fs -cat | wc -l

300024

We have the expected number of records - 300024; Now lets check if any have 'null' in them...they should not.

$ hadoop fs -ls -R sqoop-mysql/MergedDataSet |grep part* | awk '{print $8}' | xargs hadoop fs -cat | grep 'null' | wc -l

0

A quick look at the data...

$ hadoop fs -cat sqoop-mysql/MergedDataSet/part-r-00000 | more

100000,1956-01-11,Hiroyasu,Emden,M,1991-07-02
100002,1957-03-04,Claudi,Kolinko,F,1988-02-20
100004,1960-04-16,Avishai,Nitsch,M,1986-01-03
100006,1956-07-13,Janalee,Himler,F,1986-01-15
100008,1953-05-14,Otmar,Selvestrel,M,1987-05-05
100011,1956-07-20,Shmuel,Birge,M,1989-11-23

Looks good.


H. Other database specific sqoop arguments/functionality

Available at:
http://archive.cloudera.com/cdh4/cdh/4/sqoop-1.4.2-cdh4.2.0/SqoopUserGuide.html#_compatibility_notes


That's it for this post.  In my next post, I will cover Hive.  Subsequent posts will cover exports, integration with Oozie, and finally integration with HBase.  Hope this blog has been helpful. 

169 comments:

  1. Very good Tutorial free of cost ,I would like to thanks for the blog free of cost

    ReplyDelete
  2. This is very helpful. Thanks for all your posts.
    During installation I figured out, mysql driver should go under $SQOOP_HOME/lib (not bin) [See section "D) Save mySQl driver"]

    ReplyDelete
  3. Thanks a lot for posting these information, it helps a lot.

    ReplyDelete
  4. Thanks for listing out all the steps. Very easy to follow !! great work !

    ReplyDelete
  5. Hi I tried running the command to generate a jar file and i am getting the below error
    sqoop codegen --connect 'jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=lnxdb-dev-vm-277.cisco.com)(PORT=1525))(CONNECT_DATA=(SID=AFSDEV)(Server=Dedicated)))' --username AF_C3_ADMIN_V1 --password t1nt1n --table SQOOP_IMPORT --bindir /tmp/ --class-name SQOOP_IMPORT --outdir /tmp/
    Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
    Please set $HCAT_HOME to the root of your HCatalog installation.
    14/09/03 01:36:02 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
    14/09/03 01:36:02 INFO manager.SqlManager: Using default fetchSize of 1000
    14/09/03 01:36:02 INFO tool.CodeGenTool: Beginning code generation
    14/09/03 01:36:03 INFO manager.OracleManager: Time zone has been set to GMT
    14/09/03 01:36:03 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM SQOOP_IMPORT t WHERE 1=0
    14/09/03 01:36:03 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/mapr/hadoop/hadoop-0.20.2
    Note: /tmp/SQOOP_IMPORT.java uses or overrides a deprecated API.
    Note: Recompile with -Xlint:deprecation for details.
    14/09/03 01:36:04 INFO orm.CompilationManager: Writing jar file: /tmp/SQOOP_IMPORT.jar
    14/09/03 01:36:04 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException
    java.lang.NullPointerException
    at java.util.Arrays$ArrayList.(Arrays.java:2842)
    at java.util.Arrays.asList(Arrays.java:2828)
    at org.apache.sqoop.util.FileListing.getFileListingNoSort(FileListing.java:76)
    at org.apache.sqoop.util.FileListing.getFileListingNoSort(FileListing.java:82)
    at org.apache.sqoop.util.FileListing.getFileListing(FileListing.java:67)
    at com.cloudera.sqoop.util.FileListing.getFileListing(FileListing.java:39)
    at org.apache.sqoop.orm.CompilationManager.addClassFilesFromDir(CompilationManager.java:270)
    at org.apache.sqoop.orm.CompilationManager.jar(CompilationManager.java:332)
    at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:98)
    at org.apache.sqoop.tool.CodeGenTool.run(CodeGenTool.java:113)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:238)


    Is there anything i am doing wrong here. Kindly,help me.

    ReplyDelete
  6. Hi I got the issue here hope this might be useful for others: i was using both the bindir and outdir same so it is creating the issue here :)

    ReplyDelete
  7. Excellent blog..all types of examples related to sqoop at one place..Thanks!!!!

    ReplyDelete
  8. nice blog it is very useful blog for hadooop learners.very informative and recommanded one thank you author best regards from sbr

    Hadoop online training in hyderabad

    ReplyDelete
  9. You're awesome! Inspired by you, I've recorded a Step by step guide for Sqoop incremental imports. http://www.yourtechchick.com/sqoop/step-step-guide-sqoop-incremental-imports/

    Hope that helps!

    ReplyDelete
  10. Great tutorial... Thanks lott... Awesome

    ReplyDelete
  11. The blog contain useful information.Thanks for sharing it.
    Hadoop training in chennai

    ReplyDelete
  12. nice blog. thank you for sharing hadoop tutorials. Really helpful for me. Keep sharing on hadoop tutorials

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. Thanks for sharing the information very useful info about Hadoop and keep updating us, Please........

    ReplyDelete
  15. Great and interesting article to read.. i Gathered more useful and new information from this article.thanks a lot for sharing this article to us..

    best big data training institute in chennai | big data hadoop training in Velachery

    ReplyDelete
  16. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.

    Hadoop Online Training
    Data Science Online Training



    ReplyDelete
  17. this blog is a gem for revision...

    ReplyDelete
  18. Thanks for Sharing
    https://bigdatapath.wordpress.com/

    ReplyDelete

  19. i learnt new information about Import data in SQL Two which really helpful to develop my knowledge and cracking the interview easily.. This concept explanation are very clear so easy to understand..
    Also Check out the : https://www.credosystemz.com/training-in-chennai/best-hadoop-training-in-chennai/

    ReplyDelete
  20. hi all, for understanding apache sqoop as a whole staring from installation to importing data in hdfs this would be good site for learning ,do go through http://www.besthadooptraining.in

    ReplyDelete
  21. Nice blog.Thanks for sharing..



    Hadoop training

    ReplyDelete
  22. It was really a nice article and i was really impressed by reading this Hadoop Admin Online Training Hyderabad

    ReplyDelete

  23. Good blog, yours information is very effective.
    Hadoop Training in Gurgaon

    ReplyDelete
  24. Nice post ! Thanks for sharing valuable information with us. Keep sharing.. Big data hadoop online training

    ReplyDelete
  25. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  26. This comment has been removed by the author.

    ReplyDelete
  27. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  28. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    Big Data Hadoop training in electronic city

    ReplyDelete
  29. It was very useful blog for Big Data and Hadoop professionals. I am glad to recommend this blog to the hadoop and big data lovers...! Thanks

    Big Data Testing Classes
    Hadoop Big Data Classes in Pune
    Big Data Training Institutes in Pune
    Hadoop Training in Pune
    Hadoop Pune

    ReplyDelete
  30. I really liked your post, it was truly amazing. Please keep sharing more posts like this. Thank you!!
    DevOps Online Training

    ReplyDelete
  31. Positive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work.
    Also read : R Programming institutes in Chennai | R Programming Training in Chennai

    ReplyDelete
  32. Thank you for sharing such great information with us. I really appreciate everything that you’ve done here and am glad to know that you really care about the world that we live in.
    Cloud computing Training in Chennai
    Hadoop Training in Chennai
    Best institute for Cloud computing in Chennai
    Cloud computing Training Chennai
    Hadoop Course in Chennai
    best big data training in chennai

    ReplyDelete
  33. It's really a nice experience to read your post. Thank you for sharing this useful information. If you are looking for more about Big data course fees | hadoop training in chennai velachery | hadoop training course fees in chennai | Hadoop Training in Chennai Omr

    ReplyDelete
  34. Big Data Hadoop training course in Pune lets you master the concepts of the Hadoop framework and prepares you for Cloudera’s Big data certification. With our Hadoop training, you’ll learn how the components of the Hadoop ecosystem, such as Hadoop 2.7, Yarn, MapReduce, HDFS, Pig, Impala, HBase, Flume, Apache Spark, etc. fit in with the Big Data processing lifecycle. Implement real life projects in banking, telecommunication, social media, insurance, and e-commerce.

    Big Data Hadoop Training in Pune

    ReplyDelete
  35. I accept there are numerous more pleasurable open doors ahead for people that took a gander at your site.we are providing ReactJs training in Chennai.
    For more details: ReactJs training in Velachery | ReactJs training in chennai

    ReplyDelete
  36. Wow!!! I loved the way you explained pin to pin clearance. Please keep sharing these type of information. Big Data Hadoop Training in Pune

    ReplyDelete
  37. Such a nice blog, I really like what you write in this blog, I also have some relevant Information about Best HR Training In Hyderabad | Hr training institute in Hyderabad! If you want more information

    ReplyDelete
  38. Thanks for sharing a useful information.. we have learnt so much information from your blog..... keep sharing
    Best Core HR Training In Hyderabad | Core Hr training institute in Hyderabad!

    ReplyDelete
  39. Such a nice blog, I really like what you write in this blog, I also have some relevant Information about Best HR Online Training Institute|Best Core HR Online Training Institute! if you want more information.

    ReplyDelete
  40. Excellent blog I visit this blog it's really awesome. The important thing is that in this blog content written clearly and understandable. The content of information is very informative. Thanks for the excellent and great idea. keep blogging
    Best HR Training In Hyderabad | Hr training institute in Hyderabad!

    ReplyDelete
  41. Such a nice blog, I really like what you write in this blog, I also have some relevant Information about Best Core HR Training In Hyderabad | Core Hr training institute in Hyderabad! if you want more information.

    ReplyDelete
  42. Thanks for sharing a useful information.. we have learnt so much information from your blog..... keep sharing
    Best HR Online Training Institute|Best Core HR Online Training Institute!

    ReplyDelete

  43. Worthful Hadoop tutorial. Appreciate a lot for taking up the pain to write such a quality content on Hadoop tutorial. Just now I watched this similar Hadoop tutorial and I think this will enhance the knowledge of other visitors for sureHadoop Online Training

    ReplyDelete
  44. Thanks for sharing a useful information.. we have learnt so much information from your blog..... keep sharing
    Workday HCM Online Training!

    ReplyDelete
  45. It's simply an awesome post and we got some valuable information and thank you for sharing. keep sharing
    R Training Institute in Chennai | R Programming Training in Chennai

    ReplyDelete
  46. I prefer to study this kind of material. Nicely written information in this post, the quality of content is fine and the conclusion is lovely. Things are very open and intensely clear explanation of issues
    Best Spring Classroom Training Institute
    Best Devops Classroom Training Institute
    Best Corejava Classroom Training Institute
    Best Advanced Classroom Training Institute

    ReplyDelete
  47. Its a wonderful post and very helpful, thanks for all this information.
    Hadoop Training in Gurgaon

    ReplyDelete
  48. Thanks for sharing valuable article having good information and also gain worth-full knowledge.

    Oracle ICS Online Training

    ReplyDelete
  49. Thank you for sharing your awesome and valuable article this is the best blog for the students they can also learn.

    Workday Online Training

    ReplyDelete
  50. I love this post.

    โปรโมชั่นGclub ของทางทีมงานตอนนี้แจกฟรีโบนัส 50%
    เพียงแค่คุณสมัคร Gclub กับทางทีมงานของเราเพียงเท่านั้น
    ร่วมมาเป็นส่วนหนึ่งกับเว็บไซต์คาสิโนออนไลน์ของเราได้เลยค่ะ
    สมัครสมาชิกที่นี่ >>> Gclub online

    ReplyDelete
  51. Very cool!

    เว็บไซต์คาสิโนออนไลน์ที่ได้คุณภาพอับดับ 1 ของประเทศ
    เป็นเว็บไซต์การพนันออนไลน์ที่มีคนมา สมัคร Gclub Royal1688
    และยังมีเกมส์สล็อตออนไลน์ 1688 slot อีกมากมายให้คุณได้ลอง
    สมัครสมาชิกที่นี่ >>> Gclub Royal1688

    ReplyDelete
  52. My manju - if you want girls mobile numbers then this website is best for you . you can visit on this website and get their information and you also can meet with thrm and go for a date . click here to use our website --- online dating website

    ReplyDelete
  53. if you are searching for free unlimted tricks then visit now on Uvoffer.com and get unlimited offers and informations.
    film ka naam whatsapp puzzle answer film ka naam whatsapp puzzle

    ReplyDelete
  54. Whatscr - many peoples want to join random whatsapp groups . as per your demand we are ready to serve you whatsapp group links . On this website you can join unlimited groups . click and get unlimited whatsapp group links

    ReplyDelete
  55. A bewildering web journal I visit this blog, it's unfathomably heavenly. Oddly, in this present blog's substance made purpose of actuality and reasonable. The substance of data is informative
    Oracle Fusion Financials Online Training
    Oracle Fusion HCM Online Training
    Oracle Fusion SCM Online Training

    ReplyDelete
  56. A befuddling web diary I visit this blog, it's incredibly grand. Strangely, in this present blog's substance made motivation behind fact and sensible. The substance of information is instructive
    Oracle Fusion Financials Online Training
    Oracle Fusion HCM Online Training
    Oracle Fusion SCM Online Training

    ReplyDelete
  57. A befuddling web diary I visit this blog, it's incredibly grand. Strangely, in this present blog's substance made motivation behind fact and sensible. The substance of information is instructive
    Oracle Fusion Financials Online Training
    Oracle Fusion HCM Online Training
    Oracle Fusion SCM Online Training

    ReplyDelete
  58. Thanks for sharing valuable information.It will help everyone.keep Post.
    Dhankesari

    ReplyDelete
  59. Go Health Science is the best resource to get all kinds of Knowledge about Health and Science updates on Healthy Life ideas.

    ReplyDelete
  60. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging. Big Data Hadoop Training in Electronic city

    ReplyDelete
  61. Hello If you wanna use dual whatsapp on android then fmwhatsapp is for you. Not only dual whatsapp the fmwhatsapp comes with lot's of fantastic themes. If you wish to download this mod then search for fmwhatsapp apk download on google or click on the mentioned link!

    ReplyDelete
  62. An overwhelming web journal I visit this blog, it's unfathomably amazing. Unusually, in this present blog's substance made inspiration driving truth and reasonable. The substance of data is enlightening
    Oracle Fusion Financials Online Training
    Oracle Fusion HCM Online Training
    Oracle Fusion SCM Online Training

    ReplyDelete
  63. Thanks for the Valuable information.Really useful information. Thank you so much for sharing.It will help everyone.Keep Post. Find Some Indian Memes.

    Entertainment News

    ReplyDelete
  64. An overwhelming web journal I visit this blog, it's unfathomably amazing. Unusually, in this present blog's substance made inspiration driving truth and reasonable. The substance of data is enlightening
    Oracle Fusion Financials Online Training
    Oracle Fusion HCM Online Training
    Oracle Fusion SCM Online Training

    ReplyDelete
  65. This comment has been removed by the author.

    ReplyDelete
  66. A befuddling web diary I visit this blog, it's incredibly grand. Strangely, in this present blog's substance made motivation behind fact and sensible. The substance of information is instructive
    Oracle Fusion Financials Online Training
    Oracle Fusion HCM Online Training
    Oracle Fusion SCM Online Training

    ReplyDelete
  67. Thanks For Providing Us this Great Iformation .Get Our Some Quality Services Buy Adsense Accounts .
    Here is also avilable Buy Adsense Accounts .
    You Can Watch Adsense Earning Trick Here Youtube Channel Buy Adsense Accounts .

    ReplyDelete
  68. Good post!Thank you so much for sharing this pretty post,it was so good to read and useful to improve my knowledge as updated one,keep blogging.
    Big Data Hadoop training in Electronic City

    ReplyDelete
  69. clipping path EU
    For the globalization of internet & web based foundation it has become so near to reach the Door to Door to provide all kinds of support & service. Clipping Path EU is a company of web & internet base image editing service Provider Company which provides all kinds of image treatment. If you are familiar with this you will know the services like Photoshop Clipping Path, Photo Retouching, Image Masking, Color Correction, Photo Restoration, and Logo Design, Raster 2 Vector works, image editing and also Photoshop editing service.
    clipping path

    ReplyDelete
  70. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Big Data Consulting Services

    Data Lake Solutions

    Advanced Analytics

    Full Stack Development Solutions

    ReplyDelete
  71. Thanks for sharing useful information on Hadoop technology. Hadoop and Bigdata are going to be future of the computing world in the coming years. This field is a very good option that provides huge offers as career prospects for beginners and talented professionals. So, taking hadoop training in bangalore will help you to start good career in hadoop technology.

    ReplyDelete
  72. thank you very much for share this wonderful article 토토사이트

    ReplyDelete
  73. The tutorials which you are providing will really helpful to the beginners and professionals who are trying to find hadoop training institute in Bangalore. Really a valuable content. Thank you sharing.

    ReplyDelete
  74. I just loved your article on the beginners guide to starting a blog.If somebody take this blog article seriously in their life, he/she can earn his living by doing blogging.thank you for thizs article. best devops online training

    ReplyDelete
  75. Your company is one of the data migration solutions providers , which had helped in integrating useful data for business purposes. Applications designing, as well as the correct solutions of data migration, are helpful for accurate loading data.

    ReplyDelete
  76. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Big Data Services

    Data Lake Services

    Advanced Analytics Solutions

    Full Stack Development Services

    ReplyDelete
  77. I just loved your article on the beginners guide to starting a blog.If somebody take this blog article seriously in their life, he/she can earn his living by doing blogging.thank you for thizs article. best devops online training

    ReplyDelete
  78. nc article dude check out my website sometimes https//:hpx7.com 토토 사이트

    ReplyDelete
  79. Check this very awesome site!
    https://hky7.com 토토사이트

    ReplyDelete

  80. I love it when individuals come together and share
    opinions. Great website, stick with it! Visit opur page also. https://hky7.com 토토사이트

    ReplyDelete
  81. Thank you very much for the good information
    I also share informative information.
    Would you like to visit my site?

    https://hpx7.com 바카라사이트
    https://hpx7.com 바카라사이트
    https://hpx7.com 바카라사이트

    ReplyDelete

  82. Dubai video surveillance and surveillance system

    The best provider of CCTV installation services in Dubai
    We offer services at your doorstep. If you need a video surveillance camera, we will help you. We are dubai cctv installation , our engineers are certified by the dubai administration,
    Our security experts can provide all the details, we also offer video surveillance if you need bulk goods. We also offer office and home services.






    ReplyDelete
  83. Big data platform managed service should understand the need of Data, and they should work to build more appropriate services to meet the requirements of their clients.

    ReplyDelete
  84. The main motive of the Big data implementation services is to spread the knowledge so that they can give more big data engineers to the world.



    ReplyDelete
  85. Thanks for sharing this exceptional and valuable information. It will help to improve our knowledge. Again thank you for sharing this marvelous data.
    Westminster Dog Show 2020 Live Stream
    Thank you very much.

    ReplyDelete
  86. Taking a few pieces and real
    effort to generate a great article ... but what can I say ... I
    putt things off a whole lot and don't manage to get anything done.
    오피추천



    ReplyDelete
  87. Very nice post. Thank you very much.
    Watch Masters Golf Tournament 2020 Live Stream and TV Coverage Online
    Please check this out, sports lovers will love it. Thank you.

    ReplyDelete
  88. Lovely post. Sports lovers will like this game.
    FEI World Cup Finals 2020 Live Stream
    enjoy this.

    ReplyDelete
  89. This type of informational article helps many people. Hope you'll bring this type of post again.
    World Figure Skating Championships Live Stream 2020
    Thanks.

    ReplyDelete
  90. Thank you for posting this. This will helps many sports lovers.
    World Junior Figure Skating Championships 2020 Live
    I hope you'll post this type of article again.

    ReplyDelete
  91. Long time passed, I've been finding this type of post. It helps me most.
    World Junior Figure Skating Championships 2020 Updates
    Thank you very much for this type of post.

    ReplyDelete
  92. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge. Hope you'll share this type of post in regular basis.
    Saut Hermes 2020 Live Stream
    Thank you very much.

    ReplyDelete
  93. Nice article. I love it. Because this article has some content that'll help me.
    47th Daytime Emmy Awards 2020 Live Stream
    Thank you.

    ReplyDelete
  94. Good blog post. I like this.
    Watch american rodeo 2020 Live Stream
    If you are a sport lover, then check this out.

    ReplyDelete
  95. Superb informational post.
    Watch dubai world cup 2020 Live Stream
    It helps us most. Wish you best of luck.

    ReplyDelete
  96. You are doing a great job by sharing useful information about Hadoop course. It is one of the post to read and improve my knowledge in Hadoop.You can check our How to Import Data from MySQL to HDFS using sqoop,tutorial for more information about Import data from mysql to sqoop.

    ReplyDelete
  97. Thankyou for this wondrous post, I am happy I watched this site on yippee.
    사설파워볼

    ReplyDelete
  98. very nice and great thank you This type of informational article helps many people. Hope you'll bring this type of post again 슬롯사이트.

    ReplyDelete
  99. Your Website is very good, Your Website impressed us a lot, We have liked your website very much.
    We have also created a website of Android App that you can see it.
    http://damodapk.com/
    http://seniorjacket.com/

    ReplyDelete
  100. Thanks for Sharing This Article.It is very so much valuable content. I hope these Commenting lists will help to my website
    microservices online training
    best microservices online training
    top microservices online training

    ReplyDelete
  101. Lovely post. Sports lovers will like this game 파워볼사이트

    ReplyDelete
  102. Thanks for Sharing This Article.It is very so much valuable content. I hope these Commenting lists will help to my website
    blockchain online training
    best blockchain online training
    top blockchain online training

    ReplyDelete
  103. Hi if any one looking for Big Data Live Interview the below link will help the community.

    https://youtu.be/h0w456DQZgY

    ReplyDelete