What's covered in the blog?
1. Documentation on the Oozie java action
2. A sample workflow that includes oozie java action to process some syslog generated log files. Instructions on loading sample data and running the workflow are provided, along with some notes based on my learnings.
Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11a: Oozie Java API for interfacing with oozie workflows
Blog 11b: Oozie Web Service API for interfacing with oozie workflows
If you want to share your thoughts/updates, email me at email@example.com.
About the Oozie java main action
Excerpt from Apache Oozie documentation...
The java action will execute the public static void main(String args) method of the specified main Java class. Java applications are executed in the Hadoop cluster as map-reduce job with a single Mapper task. The workflow job will wait until the java application completes its execution before continuing to the next action. The java action has to be configured with the job-tracker, name-node, main Java class, JVM options and arguments.
To indicate an ok action transition, the main Java class must complete gracefully the main method invocation. To indicate an error action transition, the main Java class must throw an exception. The main Java class must not call System.exit(int n) as this will make the java action to do an error transition regardless of the used exit code.
A java action can be configured to perform HDFS files/directories cleanup before starting the Java application. This capability enables Oozie to retry a Java application in the situation of a transient or non-transient failure (This can be used to cleanup any temporary data which may have been created by the Java application in case of failure).
A java action can create a Hadoop configuration. The Hadoop configuration is made available as a local file to the Java application in its running directory, the file name is oozie-action.conf.xml . Similar to map-reduce and pig actions it is possible to refer a job.xml file and using inline configuration properties. For repeated configuration properties later values override earlier ones.
Inline property values can be parameterized (templatized) using EL expressions.
The Hadoop mapred.job.tracker (=job-tracker=) and fs.default.name (=name-node=) properties must not be present in the job-xml and in the inline configuration.
As with map-reduce and pig actions, it is possible to add files and archives to be available to the Java application. Refer to section [#FilesAchives][Adding Files and Archives for the Job].
The capture-output element can be used to propagate values back into Oozie context, which can then be accessed via EL-functions. This needs to be written out as a java properties format file. The filename is obtained via a System property specified by the constant JavaMainMapper.OOZIE_JAVA_MAIN_CAPTURE_OUTPUT_FILE
IMPORTANT: Because the Java application is run from within a Map-Reduce job, from Hadoop 0.20. onwards a queue must be assigned to it. The queue name must be specified as a configuration property.
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
The prepare element, if present, indicates a list of path do delete before starting the Java application. This should be used exclusively for directory cleanup for the Java application to be executed.
The java-opts element, if present, contains the command line parameters which are to be used to start the JVM that will execute the Java application. Using this element is equivalent to use the mapred.child.java.opts configuration property.
The arg elements, if present, contains arguments for the main function. The value of each arg element is considered a single argument and they are passed to the main method in the same order.
All the above elements can be parameterized (templatized) using EL expressions.
Apache Oozie documentation:
Sample workflow application
For this exercise, I have loaded some syslog generated logs to hdfs and am running a Java map reduce program through Oozie as a java action (not a map-reduce action) to run a report on the logs.
Pictorial representation of the workflow:
Oozie web console - screenshots:
How to use a sharelib in Oozie
Oozie workflow use cases