Wednesday, July 3, 2013

Apache Oozie - Part 5: Oozie workflow with streaming map reduce (python) action

1.0. What's covered in the blog?

1. Documentation on the Oozie mapreduce streaming action
2. A sample oozie workflow that includes a mapreduce streaming action to process some syslog generated log files using python-regex.  Instructions on loading sample data and running the workflow are provided, along with some notes based on my learnings.

Version:
Oozie 3.3.0; Pig 0.10.0

Related blogs:
Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11a: Oozie Java API for interfacing with oozie workflows
Blog 11b: Oozie Web Service API for interfacing with oozie workflows


Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

2.0. About the Oozie map-reduce streaming action

Apace documentation at: http://archive.cloudera.com/cdh4/cdh/4/oozie/WorkflowFunctionalSpec.html#a3.2.2.2_Streaming


Excerpts from Apache documentation....

2.0.1. Map-Reduce Action

The map-reduce action starts a Hadoop map/reduce job from a workflow. Hadoop jobs can be Java Map/Reduce jobs or streaming jobs.

A map-reduce action can be configured to perform file system cleanup and directory creation before starting the map reduce job. This capability enables Oozie to retry a Hadoop job in the situation of a transient failure (Hadoop checks the non-existence of the job output directory and then creates it when the Hadoop job is starting, thus a retry without cleanup of the job output directory would fail).

The workflow job will wait until the Hadoop map/reduce job completes before continuing to the next action in the workflow execution path.

The counters of the Hadoop job and job exit status (=FAILED=, KILLED or SUCCEEDED ) must be available to the workflow job after the Hadoop jobs ends. This information can be used from within decision nodes and other actions configurations.

The map-reduce action has to be configured with all the necessary Hadoop JobConf properties to run the Hadoop map/reduce job.

Hadoop JobConf properties can be specified in a JobConf XML file bundled with the workflow application or they can be indicated inline in the map-reduce action configuration.

The configuration properties are loaded in the following order, streaming , job-xml and configuration , and later values override earlier values.

Streaming and inline property values can be parameterized (templatized) using EL expressions.

The Hadoop mapred.job.tracker and fs.default.name properties must not be present in the job-xml and inline configuration.


2.0.2. Adding Files and Archives for the Job

The file , archive elements make available, to map-reduce jobs, files and archives. If the specified path is relative, it is assumed the file or archiver are within the application directory, in the corresponding sub-path. If the path is absolute, the file or archive it is expected in the given absolute path.

Files specified with the file element, will be symbolic links in the home directory of the task.

If a file is a native library (an '.so' or a '.so.#' file), it will be symlinked as and '.so' file in the task running directory, thus available to the task JVM.

To force a symlink for a file on the task running directory, use a '#' followed by the symlink name. For example 'mycat.sh#cat'.

Refer to Hadoop distributed cache documentation for details more details on files and archives.


2.0.3. Streaming

Streaming information can be specified in the streaming element.

The mapper and reducer elements are used to specify the executable/script to be used as mapper and reducer.

User defined scripts must be bundled with the workflow application and they must be declared in the files element of the streaming configuration. If the are not declared in the files element of the configuration it is assumed they will be available (and in the command PATH) of the Hadoop slave machines.

Some streaming jobs require Files found on HDFS to be available to the mapper/reducer scripts. This is done using the file and archive elements described in the previous section.
The Mapper/Reducer can be overridden by a mapred.mapper.class or mapred.reducer.class properties in the job-xml file or configuration elements.


3.0. Sample workflow application

This gist includes oozie workflow components (streaming map reduce action) to execute
python mapper and reducer scripts to parse Syslog generated log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month, and process.
Pictorial overview of workflow:
--------------------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-5-oozie-workflow-with.html
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Python mapper script: 04A-PythonMapperScript
Python reducer script: 04B-PythonReducerScript
Python script test commands: 05-PythonScriptTest
Oozie job properties: 06-JobProperties
Oozie workflow: 07-OozieWorkflowXML
Oozie job exection command: 08-OozieCommands
Oozie job output 09-Output
Oozie web GUI: 10-OozieWebGUIScreenshots
Sample data
------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
Structure
----------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
Data download
-------------
Github:
https://github.com/airawat/OozieSamples
Email me t airawat.blog@gmail.com if you have access issues.
Directory structure applicable for this blog:
---------------------------------------------
oozieProject
data
airawat-syslog
<<node>>
<<year>>
<<month>>
messages
workflowStreamingMRActionPy
workflow.xml
job.properties
LogParserMapper.py
LogParserReducer.py
Load downloaded files to HDFS
------------------------------
--Create project directory
$ hadoop fs -mkdir oozieProject
--Deploy data and workflow
$ hadoop fs -put oozieProject/* oozieProject/
Directory structure on HDFS
----------------------------
$ hadoop fs -ls -R oozieProject/workflowStreamingMRActionPy | awk '{print $8}'
/* **************************************** */
/* Python mapper script: LogParserMapper.py */
/* **************************************** */
#!/usr/bin/env /usr/bin/python
import sys
import re
sys.path.append('.')
data_pattern = r"(\w+)\s+(\d+)\s+(\d+:\d+:\d+)\s+(\w+\W*\w*)\s+(.*?\:)\s+(.*$)"
regex_obj = re.compile(data_pattern, re.VERBOSE)
# filepath = os.environ["Data/*/*/*/*"]
# filename = os.path.split(filepath)[-1]
# Get all lines from stdin
for strLineRead in sys.stdin:
# Remove leading and trailing whitespace
strLineRead = strLineRead.strip()
# Split the line into fields
parsed_log = ""
parsed_log = regex_obj.search(strLineRead)
if parsed_log:
# Output key-value pair
print '%s\t%s' % (parsed_log.group(1) + "-" + parsed_log.group(5), "1")
#print "month_name: ", parsed_log.group(1)
#print "day: ", parsed_log.group(2)
#print "time: ", parsed_log.group(3)
#print "node: ", parsed_log.group(4)
#print "event: ", parsed_log.group(5)
#print "message: ", parsed_log.group(6)
/* ****************************************** */
/* Python reducer script: LogParserReducer.py */
/* ****************************************** */
#!/usr/bin/env /usr/bin/python
import sys
sys.path.append('.')
eventCountArray = {}
# Input is from STDIN
for line in sys.stdin:
# Remove leading and trailing whitespace
line = line.strip()
# Parse the input from the mapper
event, count = line.split('\t', 1)
# Cast count to int
try:
count = int(count)
except ValueError:
continue
# Compute event count
try:
eventCountArray[event] = eventCountArray[event]+count
except:
eventCountArray[event] = count
# Write the results (unsorted) to stdout
for event in eventCountArray.keys():
print '%s\t%s'% ( event, eventCountArray[event] )
#---------------------------------------------
# Testing the python scripts outside of oozie
#---------------------------------------------
#Test the mapper from the directory where the data is located:
cat oozieProject/data/*/*/*/*/* | python oozieProject/workflowStreamingMRActionPy/LogParserMapper.py
#Test mapper and reducer
cat oozieProject/data/*/*/*/*/* | python oozieProject/workflowStreamingMRActionPy/LogParserMapper.py | sort | python oozieProject/workflowStreamingMRActionPy/LogParserReducer.py | sort
#Delete prior copy of scripts
hadoop fs -rm -R oozieProject/workflowStreamingMRActionPy/
#Load application, if not already done..
hadoop fs -put ~/oozieProject/workflowStreamingMRActionPy/ oozieProject/
#Run on cluster (update paths as needed)
hadoop jar /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -jobconf mapred.reduce.tasks=1 -file oozieProject/workflowStreamingMRActionPy/LogParserMapper.py -mapper oozieProject/workflowStreamingMRActionPy/LogParserMapper.py -file oozieProject/workflowStreamingMRActionPy/LogParserReducer.py -reducer oozieProject/workflowStreamingMRActionPy/LogParserReducer.py -input oozieProject/data/*/*/*/*/* -output oozieProject/workflowStreamingMRActionPy/output-streaming-manualRun
#View output
$ hadoop fs -ls -R oozieProject/workflowStreamingMRActionPy/output-streaming-manualRun/part* | awk '{print $8}' | xargs hadoop fs -cat
May-spice-vdagent[2020]: 1
May-ntpd_initres[997]: 3
May-nm-dispatcher.action: 4
May-NetworkManager[1232]: 1
May-init: 166
............
# -------------------------------------------------
# This is the job properties file - job.properties
# -------------------------------------------------
# Replace name node and job tracker information with that specific to your cluster
nameNode=hdfs://cdh-nn01.hadoop.com:8020
jobTracker=cdh-jt01:8021
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
oozieProjectRoot=${nameNode}/user/${user.name}/oozieProject
appPath=${oozieProjectRoot}/workflowStreamingMRActionPy
oozie.wf.application.path=${appPath}
oozieLibPath=${oozie.libpath}
inputDir=${oozieProjectRoot}/data/*/*/*/*/*
outputDir=${appPath}/output
<!-------------------------------------->
<!--Oozie workflow file: workflow.xml -->
<!-------------------------------------->
<workflow-app name="WorkflowStreamingMRAction-Python" xmlns="uri:oozie:workflow:0.1">
<start to="streamingaAction"/>
<action name="streamingaAction">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${outputDir}"/>
</prepare>
<streaming>
<mapper>python LogParserMapper.py</mapper>
<reducer>python LogParserReducer.py</reducer>
</streaming>
<configuration>
<property>
<name>oozie.libpath</name>
<value>${oozieLibPath}/mapreduce-streaming</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>1</value>
</property>
</configuration>
<file>${appPath}/LogParserMapper.py#LogParserMapper.py</file>
<file>${appPath}/LogParserReducer.py#LogParserReducer.py</file>
</map-reduce>
<ok to="end"/>
<error to="killJobAction"/>
</action>
<kill name="killJobAction">
<message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>
</kill>
<end name="end" />
</workflow-app>
08. Oozie commands
-------------------
Note: Replace oozie server and port, with your cluster-specific.
1) Submit job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowStreamingMRActionPy/job.properties -submit
job: 0000017-130712212133144-oozie-oozi-W
2) Run job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -start 0000017-130712212133144-oozie-oozi-W
3) Check the status:
$ oozie job -oozie http://cdh-dev01:11000/oozie -info 0000017-130712212133144-oozie-oozi-W
4) Suspend workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -suspend 0000017-130712212133144-oozie-oozi-W
5) Resume workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -resume 0000017-130712212133144-oozie-oozi-W
6) Re-run workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowSqoopAction/job.properties -rerun 0000017-130712212133144-oozie-oozi-W
7) Should you need to kill the job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -kill 0000017-130712212133144-oozie-oozi-W
8) View server logs:
$ oozie job -oozie http://cdh-dev01:11000/oozie -logs 0000017-130712212133144-oozie-oozi-W
Logs are available at:
/var/log/oozie on the Oozie server.
Output
-------
$ hadoop fs -ls -R oozieProject/workflowStreamingMRActionPy/output/part-* | awk '{print $8}' | xargs hadoop fs -cat
May-gnome-session[2010]: 3
May-ntpd_initres[1872]: 24
May-ntpd_initres[1705]: 792
May-spice-vdagent[2114]: 1
May-pulseaudio[2135]: 1
May-pulseaudio[2076]: 1
May-spice-vdagent[1974]: 1
May-ntpd_initres[1720]: 792
May-ntpd_initres[1084]: 6
May-spice-vdagent[2109]: 1
May-pulseaudio[2257]: 1
May-NetworkManager[1292]: 1
May-pulseaudio[2032]: 1
May-kernel: 810
May-ntpd_initres[1592]: 798
May-NetworkManager[1342]: 1
May-polkit-agent-helper-1[2036]: 8
May-spice-vdagent[1955]: 1
May-console-kit-daemon[1779]: 4
May-spice-vdagent[2016]: 1
May-ntpd_initres[1026]: 9
May-pulseaudio[2039]: 1
May-gnome-session[2009]:
...........
view raw 09-Output hosted with ❤ by GitHub
Oozie web GUI screenshots
--------------------------
Available at:
http://hadooped.blogspot.com/2013/07/apache-oozie-part-5-oozie-workflow-with.html
Components:
















Pictorial overview of workflow:


Sample application:
This gist includes oozie workflow components (streaming map reduce action) to execute
python mapper and reducer scripts to parse Syslog generated log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month, and process.
Pictorial overview of workflow:
--------------------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-5-oozie-workflow-with.html
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Python mapper script: 04A-PythonMapperScript
Python reducer script: 04B-PythonReducerScript
Python script test commands: 05-PythonScriptTest
Oozie job properties: 06-JobProperties
Oozie workflow: 07-OozieWorkflowXML
Oozie job exection command: 08-OozieCommands
Oozie job output 09-Output
Oozie web GUI: 10-OozieWebGUIScreenshots
Sample data
------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
Structure
----------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
Data download
-------------
Github:
https://github.com/airawat/OozieSamples
Email me t airawat.blog@gmail.com if you have access issues.
Directory structure applicable for this blog:
---------------------------------------------
oozieProject
data
airawat-syslog
<<node>>
<<year>>
<<month>>
messages
workflowStreamingMRActionPy
workflow.xml
job.properties
LogParserMapper.py
LogParserReducer.py
Load downloaded files to HDFS
------------------------------
--Create project directory
$ hadoop fs -mkdir oozieProject
--Deploy data and workflow
$ hadoop fs -put oozieProject/* oozieProject/
Directory structure on HDFS
----------------------------
$ hadoop fs -ls -R oozieProject/workflowStreamingMRActionPy | awk '{print $8}'
/* **************************************** */
/* Python mapper script: LogParserMapper.py */
/* **************************************** */
#!/usr/bin/env /usr/bin/python
import sys
import re
sys.path.append('.')
data_pattern = r"(\w+)\s+(\d+)\s+(\d+:\d+:\d+)\s+(\w+\W*\w*)\s+(.*?\:)\s+(.*$)"
regex_obj = re.compile(data_pattern, re.VERBOSE)
# filepath = os.environ["Data/*/*/*/*"]
# filename = os.path.split(filepath)[-1]
# Get all lines from stdin
for strLineRead in sys.stdin:
# Remove leading and trailing whitespace
strLineRead = strLineRead.strip()
# Split the line into fields
parsed_log = ""
parsed_log = regex_obj.search(strLineRead)
if parsed_log:
# Output key-value pair
print '%s\t%s' % (parsed_log.group(1) + "-" + parsed_log.group(5), "1")
#print "month_name: ", parsed_log.group(1)
#print "day: ", parsed_log.group(2)
#print "time: ", parsed_log.group(3)
#print "node: ", parsed_log.group(4)
#print "event: ", parsed_log.group(5)
#print "message: ", parsed_log.group(6)
/* ****************************************** */
/* Python reducer script: LogParserReducer.py */
/* ****************************************** */
#!/usr/bin/env /usr/bin/python
import sys
sys.path.append('.')
eventCountArray = {}
# Input is from STDIN
for line in sys.stdin:
# Remove leading and trailing whitespace
line = line.strip()
# Parse the input from the mapper
event, count = line.split('\t', 1)
# Cast count to int
try:
count = int(count)
except ValueError:
continue
# Compute event count
try:
eventCountArray[event] = eventCountArray[event]+count
except:
eventCountArray[event] = count
# Write the results (unsorted) to stdout
for event in eventCountArray.keys():
print '%s\t%s'% ( event, eventCountArray[event] )
#---------------------------------------------
# Testing the python scripts outside of oozie
#---------------------------------------------
#Test the mapper from the directory where the data is located:
cat oozieProject/data/*/*/*/*/* | python oozieProject/workflowStreamingMRActionPy/LogParserMapper.py
#Test mapper and reducer
cat oozieProject/data/*/*/*/*/* | python oozieProject/workflowStreamingMRActionPy/LogParserMapper.py | sort | python oozieProject/workflowStreamingMRActionPy/LogParserReducer.py | sort
#Delete prior copy of scripts
hadoop fs -rm -R oozieProject/workflowStreamingMRActionPy/
#Load application, if not already done..
hadoop fs -put ~/oozieProject/workflowStreamingMRActionPy/ oozieProject/
#Run on cluster (update paths as needed)
hadoop jar /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -jobconf mapred.reduce.tasks=1 -file oozieProject/workflowStreamingMRActionPy/LogParserMapper.py -mapper oozieProject/workflowStreamingMRActionPy/LogParserMapper.py -file oozieProject/workflowStreamingMRActionPy/LogParserReducer.py -reducer oozieProject/workflowStreamingMRActionPy/LogParserReducer.py -input oozieProject/data/*/*/*/*/* -output oozieProject/workflowStreamingMRActionPy/output-streaming-manualRun
#View output
$ hadoop fs -ls -R oozieProject/workflowStreamingMRActionPy/output-streaming-manualRun/part* | awk '{print $8}' | xargs hadoop fs -cat
May-spice-vdagent[2020]: 1
May-ntpd_initres[997]: 3
May-nm-dispatcher.action: 4
May-NetworkManager[1232]: 1
May-init: 166
............
# -------------------------------------------------
# This is the job properties file - job.properties
# -------------------------------------------------
# Replace name node and job tracker information with that specific to your cluster
nameNode=hdfs://cdh-nn01.hadoop.com:8020
jobTracker=cdh-jt01:8021
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
oozieProjectRoot=${nameNode}/user/${user.name}/oozieProject
appPath=${oozieProjectRoot}/workflowStreamingMRActionPy
oozie.wf.application.path=${appPath}
oozieLibPath=${oozie.libpath}
inputDir=${oozieProjectRoot}/data/*/*/*/*/*
outputDir=${appPath}/output
<!-------------------------------------->
<!--Oozie workflow file: workflow.xml -->
<!-------------------------------------->
<workflow-app name="WorkflowStreamingMRAction-Python" xmlns="uri:oozie:workflow:0.1">
<start to="streamingaAction"/>
<action name="streamingaAction">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${outputDir}"/>
</prepare>
<streaming>
<mapper>python LogParserMapper.py</mapper>
<reducer>python LogParserReducer.py</reducer>
</streaming>
<configuration>
<property>
<name>oozie.libpath</name>
<value>${oozieLibPath}/mapreduce-streaming</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>1</value>
</property>
</configuration>
<file>${appPath}/LogParserMapper.py#LogParserMapper.py</file>
<file>${appPath}/LogParserReducer.py#LogParserReducer.py</file>
</map-reduce>
<ok to="end"/>
<error to="killJobAction"/>
</action>
<kill name="killJobAction">
<message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>
</kill>
<end name="end" />
</workflow-app>
08. Oozie commands
-------------------
Note: Replace oozie server and port, with your cluster-specific.
1) Submit job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowStreamingMRActionPy/job.properties -submit
job: 0000017-130712212133144-oozie-oozi-W
2) Run job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -start 0000017-130712212133144-oozie-oozi-W
3) Check the status:
$ oozie job -oozie http://cdh-dev01:11000/oozie -info 0000017-130712212133144-oozie-oozi-W
4) Suspend workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -suspend 0000017-130712212133144-oozie-oozi-W
5) Resume workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -resume 0000017-130712212133144-oozie-oozi-W
6) Re-run workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowSqoopAction/job.properties -rerun 0000017-130712212133144-oozie-oozi-W
7) Should you need to kill the job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -kill 0000017-130712212133144-oozie-oozi-W
8) View server logs:
$ oozie job -oozie http://cdh-dev01:11000/oozie -logs 0000017-130712212133144-oozie-oozi-W
Logs are available at:
/var/log/oozie on the Oozie server.
Output
-------
$ hadoop fs -ls -R oozieProject/workflowStreamingMRActionPy/output/part-* | awk '{print $8}' | xargs hadoop fs -cat
May-gnome-session[2010]: 3
May-ntpd_initres[1872]: 24
May-ntpd_initres[1705]: 792
May-spice-vdagent[2114]: 1
May-pulseaudio[2135]: 1
May-pulseaudio[2076]: 1
May-spice-vdagent[1974]: 1
May-ntpd_initres[1720]: 792
May-ntpd_initres[1084]: 6
May-spice-vdagent[2109]: 1
May-pulseaudio[2257]: 1
May-NetworkManager[1292]: 1
May-pulseaudio[2032]: 1
May-kernel: 810
May-ntpd_initres[1592]: 798
May-NetworkManager[1342]: 1
May-polkit-agent-helper-1[2036]: 8
May-spice-vdagent[1955]: 1
May-console-kit-daemon[1779]: 4
May-spice-vdagent[2016]: 1
May-ntpd_initres[1026]: 9
May-pulseaudio[2039]: 1
May-gnome-session[2009]:
...........
view raw 09-Output hosted with ❤ by GitHub
Oozie web GUI screenshots
--------------------------
Available at:
http://hadooped.blogspot.com/2013/07/apache-oozie-part-5-oozie-workflow-with.html


Oozie web console:
Screenshots from application execution

8 comments:

  1. Hi,
    I need help regarding this issue,
    I wanted to create a streaming job from Hue UI, where mapper and reducers where shell scripts which performs word count (term frequency) and submitted the Job.
    The error is:

    2013-12-16 19:21:24,278 ERROR [main] org.apache.hadoop.streaming.PipeMapRed: configuration exception
    java.io.IOException: Cannot run program "/hadoop/yarn/local/usercache/root/appcache/application_1387201627160_0006/container_1387201627160_0006_01_000002/./maptf.sh": java.io.IOException: error=2, No such file or directory

    This means it can't able to find mapper and reducer in that path where oozie will create onfly. Can you check my oozie configuration and workflow (by email: sandeepboda91083@gmail.com) and let me know if there is any config issue?

    In HDFS, I have all paths and files setup correctly under root user.

    Note: I can able to run streaming jobs without oozie as:
    cd /root/mrtest/
    ls
    -rwxrwxrwx 1 root root 235 Dec 11 11:37 maptf.sh
    -rwxrwxrwx 1 root root 273 Dec 11 11:37 redtf.sh

    hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-76.jar -D stream.num.map.output.key.fields=1 -input crane_in1 -output crane_out2 -file ./maptf.sh -mapper maptf.sh -file ./redtf.sh -reducer redtf.sh

    ReplyDelete
  2. Thanks for helping me to understand basic Hadoop oozie workflow concepts. As a beginner in Hadoop your post help me a lot.
    Hadoop Training in Velachery | Hadoop Training .
    Hadoop Training in Chennai | Hadoop .

    ReplyDelete
  3. Thanks for sharing such details about big data and hadoop. Big data hadoop online Course Bangalore

    ReplyDelete
  4. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  5. Thanks for your article. Its very helpful.As a beginner in hadoop ,i got depth knowlege. Thanks for your informative article. Hadoop training in chennai | Hadoop Training institute in chennai

    ReplyDelete


  6. Well done! It is so well written and interactive. Keep writing such brilliant piece of work. Glad i came across this post. Last night even i saw similar wonderful Python tutorial on youtube so you can check that too for more detailed knowledge on Python.https://www.youtube.com/watch?v=HcsvDObzW2U

    ReplyDelete
  7. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete