Monday, June 10, 2013

Apache Oozie - Part 2: Workflow - hive action

What's covered in the blog?

1. Documentation on the Oozie hive action
2. A sample workflow that includes fs action, email action, and hive action (query against some syslog generated log files).  

Version: 
Oozie 3.3.0

My other blogs on Oozie:

Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie workflow - shell action +passing output from one action to another
Blog 13: Oozie workflow - SSH action


Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

About the Hive action

http://archive.cloudera.com/cdh4/cdh/4/oozie/DG_HiveActionExtension.html

Salient features of the hive action:

- Runs the specified hive job synchronously (the workflow job will wait until the Hive job completes before continuing to the next action).
- Can be configured to create or delete HDFS directories before starting the Hive job.
- Supports Hive scripts with parameter variables, their syntax is ${VARIABLES} .
- Hive configuration needs to be specified as part of the job submission
Oozie EL expressions can be used in the inline configuration. Property values specified in the configuration element override values specified in the job-xml file.
Note that Hadoop mapred.job.tracker and fs.default.name properties must not be present in the inline configuration.
As with Hadoop map-reduce jobs, it is possible to add files and archives in order to make them available to the Hive job. 

Components of a workflow with hive action:

For a workflow with (just a) hive action, the following are required:
1.  workflow.xml
2.  job.properties
3.  Any files, archives, jars you want to add
4.  hive-site.xml
5.  Hive query scripts

Refer sample program below.

Sample program

Highlights:

The workflow application runs a report on data in Hive.  The input is log data (Syslog generated) in Hive, output is a table containing the report results in Hive. 

Pictorial overview of application:
















Application:

This gist includes components of a simple workflow application (oozie 3.3.0) that generates
a report off of data in a Hive table; Emails are sent out to notify designated users of
success/failure of workflow. The data is syslog generated log files; The Hive table row
format serde is regex serde.
The sample application includes:
--------------------------------
1. Oozie actions: hive action and email action
2. Oozie workflow controls: start, end, and kill.
3. Workflow components: job.properties and workflow.xml
4. Sample data
5. Commands to deploy workflow, submit and run workflow
6. Oozie web console - screenshots from sample program execution
Pictorial overview of workflow:
-------------------------------
Available at:
http://hadooped.blogspot.com/2013/06/apache-oozie-part-2-workflow-hive-action.html
01. Workflow Components:
------------------------
1. job.properties
File containing:
a) parameter and value declarations that are referenced in the workflows, and
b) environment information referenced by Oozie to run the workflow including name node, job tracker, workflow application path etc
2. workflow.xml
Workflow definition file
3. hive-site.xml
Hive configuration file
4. Report query file in hive query language
02a.1. Download location:
-----------------------
Github:
https://github.com/airawat/OozieSamples
Email me at airawat.blog@gmail.com to contact me if you have access issues.
02a.2. Directory structure applicable for this gist/blog/post:
---------------------------------------------------------------
oozieProject
data
airawat-syslog
<<node>>
<<year>>
<<month>>
messages
workflowHiveAction
job.prperties
workflow.xml
hive-site.xml
runHiveLogReport.hql
02b.1. Sample data
-------------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
02b.2. Structure
-----------------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
03. Oozie SMTP configuration
----------------------------
Add the following to the oozie-site.xml, and restart oozie.
Replace values with the same specific to your environment.
<!-- SMTP params-->
<property>
<name>oozie.email.smtp.host</name>
<value>cdh-dev01</value>
</property>
<property>
<name>oozie.email.smtp.port</name>
<value>25</value>
</property>
<property>
<name>oozie.email.from.address</name>
<value>oozie@cdh-dev01</value>
</property>
<property>
<name>oozie.email.smtp.auth</name>
<value>false</value>
</property>
<property>
<name>oozie.email.smtp.username</name>
<value></value>
</property>
<property>
<name>oozie.email.smtp.password</name>
<value></value>
</property>
#*****************************
# 04. job.properties
#*****************************
nameNode=hdfs://cdh-nn01.chuntikhadoop.com:8020
jobTracker=cdh-jt01:8021
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
oozieProjectRoot=${nameNode}/user/${user.name}/oozieProject
appPath=${oozieProjectRoot}/workflowHiveAction
oozie.wf.application.path=${appPath}
emailToAddress=akhanolk@cdh-dev01
#*******End************************
Note: -The line - "oozie.wf.rerun.failnodes=true" is needed if you want to re-run; There is another config we can use instead as well that specifies which failed nodes to skip. Review Apache Oozie documentation for the same.
<!--******************************************-->
<!--05. workflow.xml -->
<!--******************************************-->
<workflow-app name="WorkflowWithHiveAction" xmlns="uri:oozie:workflow:0.1">
<start to="hiveAction"/>
<action name="hiveAction">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${appPath}/hive-site.xml</job-xml>
<script>${appPath}/runHiveLogReport.hql</script>
</hive>
<ok to="sendEmailSuccess"/>
<error to="sendEmailKillHive"/>
</action>
<action name="sendEmailSuccess">
<email xmlns="uri:oozie:email-action:0.1">
<to>${emailToAddress}</to>
<subject>Status of workflow ${wf:id()}</subject>
<body>The workflow ${wf:id()} completed successfully</body>
</email>
<ok to="end"/>
<error to="end"/>
</action>
<action name="sendEmailKillHive">
<email xmlns="uri:oozie:email-action:0.1">
<to>${emailToAddress}</to>
<subject>Status of workflow ${wf:id()}</subject>
<body>The workflow ${wf:id()} had issues with the hive action and was killed. The error logged is: ${wf:errorMessage(wf:lastErrorNode());}</body>
</email>
<ok to="killJobAction"/>
<error to="killJobAction"/>
</action>
<kill name="killJobAction">
<message>"Killed job due to error in FS Action"</message>
</kill>
<end name="end" />
</workflow-app>
06a. Hdfs commands
-------------------
1) Load the data and workflow application to hadoop
$ hadoop fs -mkdir oozieProject
$ hadoop fs -put oozieProject/* oozieProject/
2) Validate load
$ hadoop fs -ls -R oozieProject |awk '{print $8}'
..should match directory listing in section 2, above.
************************
**06b. Hive setup tasks
************************
a) Create table:
hive>
CREATE EXTERNAL TABLE SysLogEvents(
month_name STRING,
day STRING,
time STRING,
host STRING,
event STRING,
log STRING)
PARTITIONED BY(node string,year int, month int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W*\\w*)\\s+(.*?\\:)\\s+(.*$)"
)
stored as textfile;
b) Create and load partitions:
Note: Replace my user ID "akhanolk" with yours
hive >
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dev01",year=2013, month=04)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dev01/2013/04/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dev01",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dev01/2013/05/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dn01",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dn01/2013/05/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dn02",year=2013, month=04)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dn02/2013/04/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dn02",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dn02/2013/05/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dn03",year=2013, month=04)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dn03/2013/04/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-dn03",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-dn03/2013/05/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-jt01",year=2013, month=04)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-jt01/2013/04/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-jt01",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-jt01/2013/05/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-nn01",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-nn01/2013/05/';
Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-vms",year=2013, month=05)
location '/user/akhanolk/oozieProject/data/airawat-syslog/cdh-vms/2013/05/';
c) Hive ql to test data loaded:
hive>
--Print headers
set hive.cli.print.header=true;
--Need to add this jar for MR to work..your env may not need it
add jar hadoop-lib/hive-contrib-0.10.0-cdh4.2.0.jar;
--Sample query
hive> select * from SysLogEvents limit 2;
OK
month_name day time host event log node year month
Apr 23 16:14:10 cdh-dev01 spice-vdagent[5657]: Missing virtio device '/dev/virtio-ports/com.redhat.spice.0': No such file or directory cdh-dev01 2013 04
Apr 23 16:14:12 cdh-dev01 pulseaudio[5705]: pid.c: Daemon already running. cdh-dev01 2013 04
Time taken: 13.241 seconds
************************************************
**06c. Hive QL - runHiveLogReport.hql
************************************************
use default;
drop table if exists eventsgranularreport;
CREATE TABLE IF NOT EXISTS eventsgranularreport(
year int,
month int,
day int,
event STRING,
occurrence int)
ROW FORMAT DELIMITED
FIELDS TERMINATED by ','
LINES TERMINATED by '\n';
INSERT OVERWRITE TABLE eventsgranularreport
select Year,Month,Day,Event,Count(*) Occurrence from SysLogEvents group by year,month,day,event order by event asc,year,month,day desc;
07. Oozie commands
-------------------
Note: Replace oozie server and port, with your cluster-specific.
1) Submit job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowHiveAction/job.properties -submit
job: 0000007-130712212133144-oozie-oozi-W
2) Run job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -start 0000007-130712212133144-oozie-oozi-W
3) Check the status:
$ oozie job -oozie http://cdh-dev01:11000/oozie -info 0000007-130712212133144-oozie-oozi-W
4) Suspend workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -suspend 0000007-130712212133144-oozie-oozi-W
5) Resume workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -resume 0000007-130712212133144-oozie-oozi-W
6) Re-run workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowHiveAction/job.properties -rerun 0000007-130712212133144-oozie-oozi-W
7) Should you need to kill the job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -kill 0000007-130712212133144-oozie-oozi-W
8) View server logs:
$ oozie job -oozie http://cdh-dev01:11000/oozie -logs 0000007-130712212133144-oozie-oozi-W
Logs are available at:
/var/log/oozie on the Oozie server.
08. Output
-----------
hive> select * from eventsgranularreport;
OK
year month day event occurrence
2013 5 NULL NULL 25
2013 5 3 NetworkManager[1232]: 1
2013 5 7 NetworkManager[1243]: 1
2013 5 7 NetworkManager[1284]: 1
2013 5 7 NetworkManager[1292]: 1
2013 5 7 NetworkManager[1300]: 1
2013 5 3 NetworkManager[1342]: 1
2013 5 7 NetworkManager[1422]: 1
2013 5 3 console-kit-daemon[1580]: 1
2013 5 3 console-kit-daemon[1779]: 4
2013 5 7 console-kit-daemon[1861]: 1
2013 5 7 console-kit-daemon[1964]: 1
2013 5 7 gnome-session[1902]: 1
2013 5 7 gnome-session[1980]: 3
2013 5 7 gnome-session[2009]: 3
2013 5 7 gnome-session[2010]: 3
2013 5 7 gnome-session[2033]: 1
2013 5 7 init: 104
2013 5 6 init: 12
2013 5 3 init: 50
2013 5 7 kernel: 653
2013 5 6
...
....
.....
09-Sample email
---------------
From akhanolk@cdh-dev01.localdomain Mon Jul 15 00:34:39 2013
Return-Path: <akhanolk@cdh-dev01.localdomain>
X-Original-To: akhanolk@cdh-dev01
Delivered-To: akhanolk@cdh-dev01.localdomain
From: akhanolk@cdh-dev01.localdomain
To: akhanolk@cdh-dev01.localdomain
Subject: Status of workflow 0000011-130712212133144-oozie-oozi-W
Content-Type: text/plain; charset=us-ascii
Date: Mon, 15 Jul 2013 00:34:39 -0500 (CDT)
Status: R
The workflow 0000011-130712212133144-oozie-oozi-W completed successfully
10. Available at:
-----------------
http://hadooped.blogspot.com/2013/06/apache-oozie-part-2-workflow-hive-action.html

Oozie web console:

Screenshots of application execution:




18 comments:

  1. Hi,
    can you please show me how to move files from local directory to Hdfs every hour using oozie workflow.

    i have tried oozie fs action but it only works if im moving file from one folder to another in hdfs.

    i want to move a file from local directory to hdfs. please help!!!

    ReplyDelete
    Replies
    1. Hi Khatutshelo-
      Check out my blog 13 on Oozie ssh action.

      Cheers,
      Anagha

      Delete
    2. Hi Anagha, What is the difference between ssh action and shell action? When should i go for shh action over the shell action. Could you please give some idea on the same.

      Delete
  2. i have also same problem could plese post the process

    ReplyDelete
  3. There are several avenues to ingesting data to HDFS.
    Here is a good discussion on this topic to help arrive at the best solution based on your requirements-
    http://www.linkedin.com/groups/Data-Ingestion-Into-Hadoop-3638279.S.199779955

    ReplyDelete
  4. Hive + oozie workflow problem.

    my task is to create an oozie workflow to Load Data to Hive tables every hour.

    i am using Hue 2.3.0

    When i run the command: LOAD DATA INPATH '/user/username1/data/data3.txt' INTO TABLE raw_data; it works perfectly, data gets loaded to the hive table.

    *When i run the same command on oozie workflow the job get killed at 66% and the error message is* Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [10001]

    ----
    but whe i replace "LOAD DATA INPATH '/user/username1/data/data3.txt' INTO TABLE raw_data;" with "create external table table_data (json string) LOCATION '/user/username_k/20130925_data';" the oozie workflow works properly
    May you please help.

    ReplyDelete
  5. Khathutshelo, can you send me your scripts/code, directory structure, samples files etc so I can try to replicate? Thanks.

    ReplyDelete
  6. Hi Anagha,

    i am using Hue 2.3.0
    -----------------------------------
    what i did is:

    create an external table1 where the location is the HDFS directory with data, it has 4 columns

    create another external table2 with 3 columns and a partition column

    write a script to Insert table2 with partition from table1
    the problem is insert script works properly if i put it on the query editor but throws an 'Table not found'error

    table1 and table2 are both in the same database(adhoc)

    hive-script.q
    ---------------------
    INSERT OVERWRITE TABLE production1 PARTITION(dt, hour)
    SELECT a, b, datastring, hour FROM staging2;
    --------------------------------------------------------------------------------------------------------------------



    ReplyDelete
  7. since the partition name will be created automatically, i decide not to use the following method:
    Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-vms",year=2013, month=05)

    however i make sure that the first table has 1 more column compared to the second table that has partition. the first column of the first table become the patirtion name on the second table.

    eg table1 has 3 columns and table2 has 2 column and a partition column

    ReplyDelete
  8. Hi Anagha,

    Thanks for the blog. Would you mind giving an example of creating a workflow where in you can pass parameters. And also how to run it.

    Thanks for the wonderful blog and helping us out !

    ReplyDelete
  9. Hi,

    I'm planning to pass oozie workflow as a parameter and input value to hive script how to do this please advice.

    create table wfl_tbl(date string,wf_id string);
    insert into wfl_tbl select '${wf:id()}', '2014-015-15' from dummy ;

    I'm submitting the above from Hue oozie editor it is taing ${wf:id()}. Please advice.

    Regards,
    R

    ReplyDelete
  10. Hi Anagha,

    I am trying to create a directory based on the job_id created after running the oozie job using a shell script. can you help me ? can we pass JOB_ID as parameter while executing the script. id so how to do it..?

    ReplyDelete
  11. interesting blog to read.. After reading this blog i learnt more useful information from this blog.. thanks a lot for sharing this blog

    best big data training | hadoop training institute in chennai | big data training institute in chennai

    ReplyDelete
  12. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  13. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  14. Hey, would you mind if I share your blog with my twitter group? There’s a lot of folks that I think would enjoy your content. Please let me know. Thank you.
    Java Training in Chennai | J2EE Training in Chennai | Advanced Java Training in Chennai | Core Java Training in Chennai | Java Training institute in Chennai

    ReplyDelete
  15. Nice post. it was so informative and keep sharing. Home lifts India

    ReplyDelete