Monday, June 17, 2013

Apache Oozie - Part 6: Oozie workflow with java main action


What's covered in the blog?

1. Documentation on the Oozie java action
2. A sample workflow that includes oozie java action to process some syslog generated log files.  Instructions on loading sample data and running the workflow are provided, along with some notes based on my learnings.

Versions covered:
Oozie 3.3.0;

Related blogs:
Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11a: Oozie Java API for interfacing with oozie workflows
Blog 11b: Oozie Web Service API for interfacing with oozie workflows


Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.



About the Oozie java main action

Excerpt from Apache Oozie documentation...
The java action will execute the public static void main(String[] args) method of the specified main Java class.  Java applications are executed in the Hadoop cluster as map-reduce job with a single Mapper task.  The workflow job will wait until the java application completes its execution before continuing to the next action.  The java action has to be configured with the job-tracker, name-node, main Java class, JVM options and arguments.

To indicate an ok action transition, the main Java class must complete gracefully the main method invocation.  To indicate an error action transition, the main Java class must throw an exception.  The main Java class must not call System.exit(int n) as this will make the java action to do an error transition regardless of the used exit code.

A java action can be configured to perform HDFS files/directories cleanup before starting the Java application. This capability enables Oozie to retry a Java application in the situation of a transient or non-transient failure (This can be used to cleanup any temporary data which may have been created by the Java application in case of failure).

A java action can create a Hadoop configuration. The Hadoop configuration is made available as a local file to the Java application in its running directory, the file name is oozie-action.conf.xml . Similar to map-reduce and pig actions it is possible to refer a job.xml file and using inline configuration properties. For repeated configuration properties later values override earlier ones.

Inline property values can be parameterized (templatized) using EL expressions.

The Hadoop mapred.job.tracker (=job-tracker=) and fs.default.name (=name-node=) properties must not be present in the job-xml and in the inline configuration.

As with map-reduce and pig actions, it is possible to add files and archives to be available to the Java application. Refer to section [#FilesAchives][Adding Files and Archives for the Job].

The capture-output element can be used to propagate values back into Oozie context, which can then be accessed via EL-functions. This needs to be written out as a java properties format file. The filename is obtained via a System property specified by the constant JavaMainMapper.OOZIE_JAVA_MAIN_CAPTURE_OUTPUT_FILE

IMPORTANT: Because the Java application is run from within a Map-Reduce job, from Hadoop 0.20. onwards a queue must be assigned to it. The queue name must be specified as a configuration property.

Syntax:

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="[NODE-NAME]">
        <java>
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
               <delete path="[PATH]"/>
               ...
               <mkdir path="[PATH]"/>
               ...
            </prepare>
            <job-xml>[JOB-XML]</job-xml>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <main-class>[MAIN-CLASS]</main-class>
<java-opts>[JAVA-STARTUP-OPTS]</java-opts>
<arg>ARGUMENT</arg>
            ...
            <file>[FILE-PATH]</file>
            ...
            <archive>[FILE-PATH]</archive>
            ...
            <capture-output />
        </java>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>
The prepare element, if present, indicates a list of path do delete before starting the Java application. This should be used exclusively for directory cleanup for the Java application to be executed.

The java-opts element, if present, contains the command line parameters which are to be used to start the JVM that will execute the Java application. Using this element is equivalent to use the mapred.child.java.opts configuration property.

The arg elements, if present, contains arguments for the main function. The value of each arg element is considered a single argument and they are passed to the main method in the same order.

All the above elements can be parameterized (templatized) using EL expressions.


Apache Oozie documentation:
http://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a3.2.7_Java_Action


Sample workflow application

Highlights:
For this exercise, I have loaded some syslog generated logs to hdfs and am running a Java map reduce program through Oozie as a java action (not a map-reduce action) to run a report on the logs.

Pictorial representation of the workflow:

Components of workflow application:

















Workflow application:




Oozie web console - screenshots:


















References


Map reduce cookbook
https://cwiki.apache.org/OOZIE/map-reduce-cookbook.html

How to use a sharelib in Oozie
http://blog.cloudera.com/blog/2012/12/how-to-use-the-sharelib-in-apache-oozie/

Everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie
http://www.slideshare.net/ChicagoHUG/everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie

Oozie workflow use cases
https://github.com/yahoo/oozie/wiki/Oozie-WF-use-cases 





7 comments:

  1. This tutorial is GREEEAAAAT. Excellent layout, data sample and educational instructions. Have adapted your code to run with Hortonworks HDP 2.1 (Hadoop 2.40). Made a few changes from your implementation. 1) GZip all syslog files into a single gz file for the mapper input (instead of the 37 files & 20 folders in the original uncompressed sample). 2) simplify the map output key (can no longer extract 2013 from the Filesplit as the input file is now a single GZ file.

    BTW, In the map method, to remove [nnnn] in the event name (Example: "NetworkManager[1459]" becomes "NetworkManager") I think this code is shorter than testing "[" and do substring: objPatternMatcher.group(5).replaceAll("\\[\\d+\\]", "")

    ReplyDelete
  2. Hi Anagha,
    Thanks a lot for the wonderful post!! Your blogs are easy to understand and code examples are very helpful. Great stuff.

    Need one suggestion regarding the logging in Map-Reduce Driver class. when I am calling the driver class as java action, everything runs fine But the logger statements are not visible in oozie Job Log tab. Is there any configure in order to see these logs? Any suggestion/ hints

    Thanks

    ReplyDelete
  3. hi your tutorial was very good ,keep rocking expecting more post from you Hadoop Training in Velachery | Hadoop Training .

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  6. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  7. Just found your post by searching on the Google, I am Impressed and Learned Lot of new thing from your post.

    ivanka hot

    ReplyDelete