1.0. What's covered in the blog?1. Documentation on the Oozie mapreduce streaming action
2. A sample oozie workflow that includes a mapreduce streaming action to process some syslog generated log files using python-regex. Instructions on loading sample data and running the workflow are provided, along with some notes based on my learnings.
Oozie 3.3.0; Pig 0.10.0
Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11a: Oozie Java API for interfacing with oozie workflows
Blog 11b: Oozie Web Service API for interfacing with oozie workflows
If you want to share your thoughts/updates, email me at firstname.lastname@example.org.
2.0. About the Oozie map-reduce streaming actionApace documentation at: http://archive.cloudera.com/cdh4/cdh/4/oozie/WorkflowFunctionalSpec.html#a188.8.131.52_Streaming
Excerpts from Apache documentation....
2.0.1. Map-Reduce ActionThe map-reduce action starts a Hadoop map/reduce job from a workflow. Hadoop jobs can be Java Map/Reduce jobs or streaming jobs.
A map-reduce action can be configured to perform file system cleanup and directory creation before starting the map reduce job. This capability enables Oozie to retry a Hadoop job in the situation of a transient failure (Hadoop checks the non-existence of the job output directory and then creates it when the Hadoop job is starting, thus a retry without cleanup of the job output directory would fail).
The workflow job will wait until the Hadoop map/reduce job completes before continuing to the next action in the workflow execution path.
The counters of the Hadoop job and job exit status (=FAILED=, KILLED or SUCCEEDED ) must be available to the workflow job after the Hadoop jobs ends. This information can be used from within decision nodes and other actions configurations.
The map-reduce action has to be configured with all the necessary Hadoop JobConf properties to run the Hadoop map/reduce job.
Hadoop JobConf properties can be specified in a JobConf XML file bundled with the workflow application or they can be indicated inline in the map-reduce action configuration.
The configuration properties are loaded in the following order, streaming , job-xml and configuration , and later values override earlier values.
Streaming and inline property values can be parameterized (templatized) using EL expressions.
The Hadoop mapred.job.tracker and fs.default.name properties must not be present in the job-xml and inline configuration.
2.0.2. Adding Files and Archives for the JobThe file , archive elements make available, to map-reduce jobs, files and archives. If the specified path is relative, it is assumed the file or archiver are within the application directory, in the corresponding sub-path. If the path is absolute, the file or archive it is expected in the given absolute path.
Files specified with the file element, will be symbolic links in the home directory of the task.
If a file is a native library (an '.so' or a '.so.#' file), it will be symlinked as and '.so' file in the task running directory, thus available to the task JVM.
To force a symlink for a file on the task running directory, use a '#' followed by the symlink name. For example 'mycat.sh#cat'.
Refer to Hadoop distributed cache documentation for details more details on files and archives.
2.0.3. StreamingStreaming information can be specified in the streaming element.
The mapper and reducer elements are used to specify the executable/script to be used as mapper and reducer.
User defined scripts must be bundled with the workflow application and they must be declared in the files element of the streaming configuration. If the are not declared in the files element of the configuration it is assumed they will be available (and in the command PATH) of the Hadoop slave machines.
Some streaming jobs require Files found on HDFS to be available to the mapper/reducer scripts. This is done using the file and archive elements described in the previous section.
The Mapper/Reducer can be overridden by a mapred.mapper.class or mapred.reducer.class properties in the job-xml file or configuration elements.
3.0. Sample workflow application
Screenshots from application execution