Friday, June 14, 2013

Apache Sqoop - Part 5: Scheduling Sqoop jobs in Oozie

What's covered in the blog?

1. Documentation on the Oozie sqoop action
2. A sample workflow (against syslog generated logs) that includes oozie sqoop action (export from hive to mysql).  Instructions on loading sample data and running the workflow are provided, along with some notes based on my learnings.

For scheduling an Oozie worklflow containing a Sqoop action to be event driven - time or data availability driven, read my blog on Oozie coordinator jobs.

Versions covered:
Oozie 3.3.0; Sqoop (1.4.2) with Mysql (5.1.69 ) 

My blogs on Sqoop:

Blog 1: Import from mysql into HDFS
Blog 2: Import from mysql into Hive
Blog 3: Export from HDFS and Hive into mysql
Blog 4: Sqoop best practices
Blog 5: Scheduling of Sqoop tasks using Oozie
Blog 6: Sqoop2

My blogs on Oozie:

Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11a: Oozie Java API for interfacing with oozie workflows
Blog 11b: Oozie Web Service API for interfacing with oozie workflows

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.


About the oozie sqoop action

Apache Oozie documentation on Sqoop action:
http://archive.cloudera.com/cdh/3/oozie/DG_SqoopActionExtension.html

Salient features of the sqoop action:
Excerpt from Apache documentation..
- The sqoop action runs a Sqoop job synchronously.- The information to be included in the oozie sqoop action  are the job-tracker, the name-node and Sqoop command or arg elements as well as configuration.- A prepare node can be included to do any prep work including hdfs actions.  This will be executed prior to execution of the sqoop job.- Sqoop configuration can be specified with a file, using the job-xml element, and inline, using the configuration elements.- Oozie EL expressions can be used in the inline configuration. Property values specified in the configuration element override values specified in the job-xml file.
Note that Hadoop mapred.job.tracker and fs.default.name properties must not be present in the inline configuration.
As with Hadoop map-reduce jobs, it is possible to add files and archives in order to make them available to the Sqoop job. 


Sqoop command:
The Sqoop command can be specified either using the command element or multiple arg elements.
- When using the command element, Oozie will split the command on every space into multiple arguments.- When using the arg elements, Oozie will pass each argument value as an argument to Sqoop.  The arg variant should be used when there are spaces within a single argument.  - All the above elements can be parameterized (templatized) using EL expressions.


Components of a workflow with sqoop action:
















Sample application

Highlights:
For this exercise, I have loaded some syslog generated logs to hdfs and created a hive table.
I have also created a table in mysql that will be the destination of a report (hive query) we will run 


Pictorial representation of the workflow:

Sample program:

This gist includes components of a simple workflow application (oozie 3.3.0) that
pipes data in a Hive table to mysql;
The sample application includes:
--------------------------------
1. Oozie actions: sqoop action
2. Oozie workflow controls: start, end, and kill.
3. Workflow components: job.properties and workflow.xml
4. Sample data
5. Prep tasks in Hive
6. Commands to deploy workflow, submit and run workflow
7. Oozie web console - screenshots from sample program execution
Pictorial overview of workflow:
-------------------------------
Available at:
http://hadooped.blogspot.com/2013/06/apache-sqoop-part-5-scheduling-sqoop.html
Includes:
---------
01-WorkflowComponents
02a-DataAndCodeDownload
02b-DataSampleAndStructure
03-HdfsLoadCommands
04a-HiveSetupTasks
04b-HiveReport-SourceForSqoop
05-mySqlSetUp
06-SqoopTaskStandAloneTest
07-JobPropertiesFile
08-WorkflowXMLFile
09-Oozie commands
10-OutputInMysql
11-Oozie web console screenshots
01. Workflow Components:
------------------------
1. job.properties
File containing:
a) parameter and value declarations that are referenced in the workflows, and
b) environment information referenced by Oozie to run the workflow including name node, job tracker, workflow application path etc
2. workflow.xml
Workflow definition file
02a.1. Download location:
-------------------------
Github-
https://github.com/airawat/OozieSamples
Email me at airawat.blog@gmail.com to contact me if you have access issues.
02a.2. Directory structure applicable for this gist:
----------------------------------------------------
oozieProject
data
airawat-syslog
<<node>>
<<year>>
<<month>>
messages
workflowSqoopAction
job.prperties
workflow.xml
hive-site.xml
02b.1. Sample data
--------------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
02b.2. Structure
-----------------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
03. Hdfs commands
-------------------
1) Load the data and workflow application to hadoop
$ hadoop fs -mkdir oozieProject
$ hadoop fs -put oozieProject/* oozieProject/
2) Validate load
$ hadoop fs -ls -R oozieProject |awk '{print $8}'
..should match directory listing in section 2, above.
************************
**04a. Hive setup tasks
************************
a) Create table:
hive>
CREATE EXTERNAL TABLE SysLogEvents(
month_name STRING,
day STRING,
time STRING,
host STRING,
event STRING,
log STRING)
PARTITIONED BY(node string,year int, month int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W*\\w*)\\s+(.*?\\:)\\s+(.*$)"
)
stored as textfile;
b) Create and load partitions:
Note: Replace my user ID "akhanolk" with yours
hive >
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dev01",year=2013, month=04)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dev01/2013/04/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dev01",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dev01/2013/05/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dn01",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dn01/2013/05/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dn02",year=2013, month=04)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dn02/2013/04/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dn02",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dn02/2013/05/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dn03",year=2013, month=04)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dn03/2013/04/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dn03",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dn03/2013/05/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-jt01",year=2013, month=04)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-jt01/2013/04/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-jt01",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-jt01/2013/05/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-nn01",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-nn01/2013/05/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-vms",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-vms/2013/05/';
c) Hive ql to test data loaded:
hive>
--Print headers
set hive.cli.print.header=true;
--Need to add this jar for MR to work..your env may not need it
add jar hadoop-lib/hive-contrib-0.10.0-cdh4.2.0.jar;
--Sample query
hive> select * from SysLogEvents limit 2;
OK
month_name day time host event log node year month
Apr 23 16:14:10 cdh-dev01 spice-vdagent[5657]: Missing virtio device '/dev/virtio-ports/com.redhat.spice.0': No such file or directory cdh-dev01 2013 04
Apr 23 16:14:12 cdh-dev01 pulseaudio[5705]: pid.c: Daemon already running. cdh-dev01 2013 04
Time taken: 13.241
*********************************
**04b-HiveReport-SourceForSqoop
*********************************
--Run the following in Hive; It creates and populates the source table for the sqoop action;
use default;
drop table if exists eventsgranularreport;
CREATE TABLE IF NOT EXISTS eventsgranularreport(
year int,
month int,
day int,
event STRING,
occurrence int)
ROW FORMAT DELIMITED
FIELDS TERMINATED by ','
LINES TERMINATED by '\n';
INSERT OVERWRITE TABLE eventsgranularreport
select Year,Month,Day,Event,Count(*) Occurrence from SysLogEvents group by year,month,day,event order by event asc,year,month,day desc;
****************************
**Prep tasks - mysql
****************************
mysql> create database airawat;
mysql> use airawat;
mysql> CREATE TABLE IF NOT EXISTS eventsgranularreport(
year INTEGER,
month INTEGER,
day INTEGER,
event VARCHAR(100),
occurrence INTEGER);
mysql> create user 'devUser'@'machineName' identified by 'myPwd';
GRANT ALL PRIVILEGES ON airawat.* TO 'devUser'@'machineName' WITH GRANT OPTION;
view raw 05-mySqlSetUp hosted with ❤ by GitHub
**************************************
**Sqoop command - standlone test
**************************************
a) Command run on Sqoop client:
sqoop export \
--connect jdbc:mysql://cdh-dev01/airawat \
--username devUser \
--password myPwd \
--table eventsgranularreport \
--direct \
--enclosed-by '\"' \
--export-dir /user/hive/warehouse/eventsgranularreport
b) Output in mysql:
mysql>
select * from eventsgranularreport;
mysql> select * from eventsgranularreport;
+------+-------+------+------------------------------+------------+
| year | month | day | event | occurrence |
+------+-------+------+------------------------------+------------+
| 2013 | 5 | NULL | NULL | 25 |
| 2013 | 5 | 3 | NetworkManager[1232]: | 1 |
| 2013 | 5 | 7 | NetworkManager[1243]: | 1 |
| 2013 | 5 | 7 | NetworkManager[1284]: | 1 |
.....
| 2013 | 5 | 7 | pulseaudio[2074]: | 1 |
| 2013 | 5 | 7 | pulseaudio[2076]: | 1 |
| 2013 | 5 | 7 | pulseaudio[2106]: | 1 |
| 2013 | 5 | 7 | pulseaudio[2116]: | 1 |
| 2013 | 5 | 7 | pulseaudio[2135]: | 1 |
+------+-------+------+------------------------------+------------+
104 rows in set (0.08 sec)
mysql> delete from eventsgranularreport;
Query OK, 104 rows affected (0.02 sec)
#*****************************
# 07: job.properties
#*****************************
nameNode=hdfs://cdh-nn01.chuntikhadoop.com:8020
jobTracker=cdh-jt01:8021
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
oozieProjectRoot=${nameNode}/user/${user.name}/oozieProject
appPath=${oozieProjectRoot}/workflowSqoopAction
oozie.wf.application.path=${appPath}
emailToAddress=akhanolk@cdh-dev01
#**********end****************************
- Note: The way the name node and job tracker information is specified in job.properties should match the oozie-site.xml properties for oozie.service.HadoopAccessorService.nameNode.whitelist and oozie.service.HadoopAccessorService.jobTracker.whitelist respectively.
Read up on oozie share lib and install if you dont already have it.
http://blog.cloudera.com/blog/2012/12/how-to-use-the-sharelib-in-apache-oozie/
<!--******************************************-->
<!--08. workflow.xml -->
<!--******************************************-->
<workflow-app name="WorkflowWithSqoopAction" xmlns="uri:oozie:workflow:0.1">
<start to="sqoopAction"/>
<action name="sqoopAction">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>export --connect jdbc:mysql://cdh-dev01/airawat --username devUser --password myPwd --table eventsgranularreport --direct --enclosed-by " --export-dir /user/hive/warehouse/eventsgranularreport</command>
</sqoop>
<ok to="end"/>
<error to="killJob"/>
</action>
<kill name="killJob">
<message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>
</kill>
<end name="end" />
</workflow-app>
09. Oozie commands
-------------------
Note: Replace oozie server and port, with your cluster-specific.
1) Submit job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowSqoopAction/job.properties -submit
job: 0000012-130712212133144-oozie-oozi-W
2) Run job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -start 0000012-130712212133144-oozie-oozi-W
3) Check the status:
$ oozie job -oozie http://cdh-dev01:11000/oozie -info 0000012-130712212133144-oozie-oozi-W
4) Suspend workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -suspend 0000012-130712212133144-oozie-oozi-W
5) Resume workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -resume 0000012-130712212133144-oozie-oozi-W
6) Re-run workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowSqoopAction/job.properties -rerun 0000012-130712212133144-oozie-oozi-W
7) Should you need to kill the job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -kill 0000012-130712212133144-oozie-oozi-W
8) View server logs:
$ oozie job -oozie http://cdh-dev01:11000/oozie -logs 0000012-130712212133144-oozie-oozi-W
Logs are available at:
/var/log/oozie on the Oozie server.
*******************
***Output in mysql
*******************
mysql>
select * from eventsgranularreport;
mysql> select * from eventsgranularreport;
+------+-------+------+------------------------------+------------+
| year | month | day | event | occurrence |
+------+-------+------+------------------------------+------------+
| 2013 | 5 | NULL | NULL | 25 |
| 2013 | 5 | 3 | NetworkManager[1232]: | 1 |
| 2013 | 5 | 7 | NetworkManager[1243]: | 1 |
| 2013 | 5 | 7 | NetworkManager[1284]: | 1 |
.....
| 2013 | 5 | 7 | pulseaudio[2074]: | 1 |
| 2013 | 5 | 7 | pulseaudio[2076]: | 1 |
| 2013 | 5 | 7 | pulseaudio[2106]: | 1 |
| 2013 | 5 | 7 | pulseaudio[2116]: | 1 |
| 2013 | 5 | 7 | pulseaudio[2135]: | 1 |
+------+-------+------+------------------------------+------------+
104 rows in set (0.08 sec)
11. Available at:
-----------------
http://hadooped.blogspot.com/2013/06/apache-sqoop-part-5-scheduling-sqoop.html

Oozie web console:

Screenshots..

12 comments:

  1. Thanks a lot Anagha,your oozie post helps me alot in my deployment process...thank you very much..good day ..bye

    ReplyDelete
  2. very good post easy to uderstand, blog 2 is unable to open can share the latest link

    ReplyDelete
  3. It was late to enter in Bigdata , but found relevant , will be glad for further post

    ReplyDelete
  4. Thank you for this post...helped me to learn oozie

    ReplyDelete
  5. Privileged to read this informative blog on Hadoop.Commendable efforts to put on research the hadoop. Please enlighten us with regular updates on hadoop. Friends if you're keen to learn more about AI you can watch this amazing tutorial on the same.
    https://www.youtube.com/watch?v=1jMR4cHBwZE

    ReplyDelete
  6. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    Regards,
    Big Data Hadoop Training in electronic city, Bangalore

    ReplyDelete
  7. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  8. It was really a nice article and i was really impressed by reading this Big data hadoop online Course

    ReplyDelete
  9. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  10. Really very nice article,keep sharing more information with us .
    thank you....

    Big data online training

    Big data hadoop training

    ReplyDelete
  11. I think you did an awesome allahabad university bsc 3rd year Result job explaining it. Sure beats having to research vikram university pg result it on my own. Thanks

    ReplyDelete
  12. Nice read, I just passed this onto a friend who was doing a little research on that. Sooryavanshi Full Movie 2021

    ReplyDelete