Tuesday, June 18, 2013

Apache Oozie - Part 11: Java API for interfacing with oozie workflows


What's covered in the blog?

1. Documentation on the Oozie java API
2. Code of a sample java program that calls a oozie workflow with a java action to process some syslog generated log files.  Instructions on loading sample data, workflow files and running the workflow are provided, along with some notes based on my learnings.

Version:
Oozie 3.3.0;

Related blogs:

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

1.0. About the Oozie java API

Oozie provides a Java Client API that simplifies integrating Oozie with Java applications. This Java Client API is a convenience API to interact with Oozie Web-Services API.
The following code snippet shows how to submit an Oozie job using the Java Client API.

import org.apache.oozie.client.OozieClient;
import org.apache.oozie.client.WorkflowJob;
.
import java.util.Properties;
.
    ...
.
    // get a OozieClient for local Oozie
    OozieClient wc = new OozieClient("http://bar:11000/oozie");
.
    // create a workflow job configuration and set the workflow application path
    Properties conf = wc.createConfiguration();
    conf.setProperty(OozieClient.APP_PATH, "hdfs://foo:8020/usr/tucu/my-wf-app");
.
    // setting workflow parameters
    conf.setProperty("jobTracker", "foo:8021");
    conf.setProperty("inputDir", "/usr/tucu/inputdir");
    conf.setProperty("outputDir", "/usr/tucu/outputdir");
    ...
.
    // submit and start the workflow job
    String jobId = wc.run(conf);
    System.out.println("Workflow job submitted");
.
    // wait until the workflow job finishes printing the status every 10 seconds
    while (wc.getJobInfo(jobId).getStatus() == Workflow.Status.RUNNING) {
        System.out.println("Workflow job running ...");
        Thread.sleep(10 * 1000);
    }
.
    // print the final status o the workflow job
    System.out.println("Workflow job completed ...");
    System.out.println(wf.getJobInfo(jobId));
    ...

Source of the documentation, above:
http://archive.cloudera.com/cdh/3/oozie/DG_Examples.html#Java_API_Example


2.0. Exercise

The java program below calls the workflow built in my blog 6 - include java code, workflow related files and sample data.

2.0.1. Sample data and sample workflow


2.0.2. Sample Java program to call workflow

Note: Ensure you replace configuration highlighted in yellow ochre with your cluster specific configuration.

import java.util.Properties;

import org.apache.oozie.client.OozieClient;
import org.apache.oozie.client.WorkflowJob;

public class myOozieWorkflowJavaAPICall {

public static void main(String[] args) {
    OozieClient wc = new OozieClient("http://cdh-dev01:11000/oozie");

    Properties conf = wc.createConfiguration();

    conf.setProperty(OozieClient.APP_PATH, "hdfs://cdh-nn01.hadoop.com:8020/user/airawat/oozieProject/javaApplication/workflow.xml");
    conf.setProperty("jobTracker", "cdh-jt01:8021");
    conf.setProperty("nameNode", "hdfs://cdh-nn01.hadoop.com:8020");
    conf.setProperty("queueName", "default");
    conf.setProperty("airawatOozieRoot", "hdfs://cdh-nn01.hadoop.com:8020/user/airawat/oozieProject/javaApplication");
    conf.setProperty("oozie.libpath", "hdfs://cdh-nn01.hadoop.com:8020/user/oozie/share/lib");
    conf.setProperty("oozie.use.system.libpath", "true");
    conf.setProperty("oozie.wf.rerun.failnodes", "true");

    try {
        String jobId = wc.run(conf);
        System.out.println("Workflow job, " + jobId + " submitted");

        while (wc.getJobInfo(jobId).getStatus() == WorkflowJob.Status.RUNNING) {
            System.out.println("Workflow job running ...");
            Thread.sleep(10 * 1000);
        }
        System.out.println("Workflow job completed ...");
System.out.println(wc.getJobInfo(jobId));
    } catch (Exception r) {
        System.out.println("Errors " + r.getLocalizedMessage());
    }
}
}

2.0.3. Program output

Workflow job, 0000081-130613112811513-oozie-oozi-W submitted
Workflow job running ...
Workflow job running ...
Workflow job running ...
Workflow job running ...
Workflow job running ...
Workflow job running ...
Workflow job running ...
Workflow job completed ...
Workflow id[0000081-130613112811513-oozie-oozi-W] status[SUCCEEDED]

2.0.4. Oozie web console

http://YourOozieServer:TypicallyPort11000/oozie


















Monday, June 17, 2013

Apache Oozie - Part 6: Oozie workflow with java main action


What's covered in the blog?

1. Documentation on the Oozie java action
2. A sample workflow that includes oozie java action to process some syslog generated log files.  Instructions on loading sample data and running the workflow are provided, along with some notes based on my learnings.

Versions covered:
Oozie 3.3.0;

Related blogs:
Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11a: Oozie Java API for interfacing with oozie workflows
Blog 11b: Oozie Web Service API for interfacing with oozie workflows


Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.



About the Oozie java main action

Excerpt from Apache Oozie documentation...
The java action will execute the public static void main(String[] args) method of the specified main Java class.  Java applications are executed in the Hadoop cluster as map-reduce job with a single Mapper task.  The workflow job will wait until the java application completes its execution before continuing to the next action.  The java action has to be configured with the job-tracker, name-node, main Java class, JVM options and arguments.

To indicate an ok action transition, the main Java class must complete gracefully the main method invocation.  To indicate an error action transition, the main Java class must throw an exception.  The main Java class must not call System.exit(int n) as this will make the java action to do an error transition regardless of the used exit code.

A java action can be configured to perform HDFS files/directories cleanup before starting the Java application. This capability enables Oozie to retry a Java application in the situation of a transient or non-transient failure (This can be used to cleanup any temporary data which may have been created by the Java application in case of failure).

A java action can create a Hadoop configuration. The Hadoop configuration is made available as a local file to the Java application in its running directory, the file name is oozie-action.conf.xml . Similar to map-reduce and pig actions it is possible to refer a job.xml file and using inline configuration properties. For repeated configuration properties later values override earlier ones.

Inline property values can be parameterized (templatized) using EL expressions.

The Hadoop mapred.job.tracker (=job-tracker=) and fs.default.name (=name-node=) properties must not be present in the job-xml and in the inline configuration.

As with map-reduce and pig actions, it is possible to add files and archives to be available to the Java application. Refer to section [#FilesAchives][Adding Files and Archives for the Job].

The capture-output element can be used to propagate values back into Oozie context, which can then be accessed via EL-functions. This needs to be written out as a java properties format file. The filename is obtained via a System property specified by the constant JavaMainMapper.OOZIE_JAVA_MAIN_CAPTURE_OUTPUT_FILE

IMPORTANT: Because the Java application is run from within a Map-Reduce job, from Hadoop 0.20. onwards a queue must be assigned to it. The queue name must be specified as a configuration property.

Syntax:

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="[NODE-NAME]">
        <java>
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
               <delete path="[PATH]"/>
               ...
               <mkdir path="[PATH]"/>
               ...
            </prepare>
            <job-xml>[JOB-XML]</job-xml>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <main-class>[MAIN-CLASS]</main-class>
<java-opts>[JAVA-STARTUP-OPTS]</java-opts>
<arg>ARGUMENT</arg>
            ...
            <file>[FILE-PATH]</file>
            ...
            <archive>[FILE-PATH]</archive>
            ...
            <capture-output />
        </java>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>
The prepare element, if present, indicates a list of path do delete before starting the Java application. This should be used exclusively for directory cleanup for the Java application to be executed.

The java-opts element, if present, contains the command line parameters which are to be used to start the JVM that will execute the Java application. Using this element is equivalent to use the mapred.child.java.opts configuration property.

The arg elements, if present, contains arguments for the main function. The value of each arg element is considered a single argument and they are passed to the main method in the same order.

All the above elements can be parameterized (templatized) using EL expressions.


Apache Oozie documentation:
http://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a3.2.7_Java_Action


Sample workflow application

Highlights:
For this exercise, I have loaded some syslog generated logs to hdfs and am running a Java map reduce program through Oozie as a java action (not a map-reduce action) to run a report on the logs.

Pictorial representation of the workflow:

Components of workflow application:

















Workflow application:




Oozie web console - screenshots:


















References


Map reduce cookbook
https://cwiki.apache.org/OOZIE/map-reduce-cookbook.html

How to use a sharelib in Oozie
http://blog.cloudera.com/blog/2012/12/how-to-use-the-sharelib-in-apache-oozie/

Everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie
http://www.slideshare.net/ChicagoHUG/everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie

Oozie workflow use cases
https://github.com/yahoo/oozie/wiki/Oozie-WF-use-cases 





Apache Oozie -Part 4: Oozie workflow with java mapreduce action


What's covered in the blog?

1. Documentation on the Oozie map reduce action
2. A sample workflow that includes oozie map-reduce action to process some syslog generated log files.  Instructions on loading sample data and running the workflow are provided, along with some notes based on my learnings.

Versions covered:
Oozie 3.3.0; Map reduce new API

Related blogs:

Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie workflow - shell action +passing output from one action to another
Blog 13: Oozie workflow - SSH action

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

About the Oozie MapReduce action
Excerpt from Apache Oozie documentation...

The map-reduce action starts a Hadoop map/reduce job from a workflow. Hadoop jobs can be Java Map/Reduce jobs or streaming jobs.

A map-reduce action can be configured to perform file system cleanup and directory creation before starting the map reduce job. This capability enables Oozie to retry a Hadoop job in the situation of a transient failure (Hadoop checks the non-existence of the job output directory and then creates it when the Hadoop job is starting, thus a retry without cleanup of the job output directory would fail).

The workflow job will wait until the Hadoop map/reduce job completes before continuing to the next action in the workflow execution path.

The counters of the Hadoop job and job exit status (=FAILED=, KILLED or SUCCEEDED ) must be available to the workflow job after the Hadoop jobs ends. This information can be used from within decision nodes and other actions configurations.

The map-reduce action has to be configured with all the necessary Hadoop JobConf properties to run the Hadoop map/reduce job.

Hadoop JobConf properties can be specified in a JobConf XML file bundled with the workflow application or they can be indicated inline in the map-reduce action configuration.

The configuration properties are loaded in the following order, streaming , job-xml and configuration , and later values override earlier values.

Streaming and inline property values can be parameterized (templatized) using EL expressions.

The Hadoop mapred.job.tracker and fs.default.name properties must not be present in the job-xml and inline configuration.


Apache Oozie documentation:
http://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a3.2.2_Map-Reduce_Action


Components of a workflow with java map reduce action:



Sample workflow

Highlights

The sample workflow application runs a java map reduce program that parses log files (syslog generated) in HDFS and generates a report on the same.

The following is a pictorial representation of the workflow.


Workflow application details


Oozie web GUI - screenshots

http://YourOozieServer:TypicallyPort11000/oozie/






Do share, if you have any additional insights that can be addd to the blog.

References


Map reduce cookbook
https://cwiki.apache.org/OOZIE/map-reduce-cookbook.html

How to use a sharelib in Oozie
http://blog.cloudera.com/blog/2012/12/how-to-use-the-sharelib-in-apache-oozie/

Everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie

http://www.slideshare.net/ChicagoHUG/everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie

Oozie workflow use cases

https://github.com/yahoo/oozie/wiki/Oozie-WF-use-cases 






Apache Oozie - Part 3: Workflow with sqoop action (hive to mysql)

What's covered in the blog?

I have covered this topic in my blog-
Apache Sqoop - Part 5: Scheduling Sqoop jobs in Oozie
[Versions: Oozie 3.3.0, Sqoop (1.4.2) with Mysql (5.1.69)]

It includes:
1. Documentation on the Oozie sqoop action
2. A sample workflow (against syslog generated logs) that includes oozie sqoop action (export from hive to mysql).  Instructions on loading sample data and running the workflow are provided, along with some notes based on my learnings.


Related blogs

Oozie:

Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie workflow - shell action +passing output from one action to another

Sqoop:

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.




Friday, June 14, 2013

Apache Sqoop - Part 5: Scheduling Sqoop jobs in Oozie

What's covered in the blog?

1. Documentation on the Oozie sqoop action
2. A sample workflow (against syslog generated logs) that includes oozie sqoop action (export from hive to mysql).  Instructions on loading sample data and running the workflow are provided, along with some notes based on my learnings.

For scheduling an Oozie worklflow containing a Sqoop action to be event driven - time or data availability driven, read my blog on Oozie coordinator jobs.

Versions covered:
Oozie 3.3.0; Sqoop (1.4.2) with Mysql (5.1.69 ) 

My blogs on Sqoop:

Blog 1: Import from mysql into HDFS
Blog 2: Import from mysql into Hive
Blog 3: Export from HDFS and Hive into mysql
Blog 4: Sqoop best practices
Blog 5: Scheduling of Sqoop tasks using Oozie
Blog 6: Sqoop2

My blogs on Oozie:

Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11a: Oozie Java API for interfacing with oozie workflows
Blog 11b: Oozie Web Service API for interfacing with oozie workflows

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.


About the oozie sqoop action

Apache Oozie documentation on Sqoop action:
http://archive.cloudera.com/cdh/3/oozie/DG_SqoopActionExtension.html

Salient features of the sqoop action:
Excerpt from Apache documentation..
- The sqoop action runs a Sqoop job synchronously.- The information to be included in the oozie sqoop action  are the job-tracker, the name-node and Sqoop command or arg elements as well as configuration.- A prepare node can be included to do any prep work including hdfs actions.  This will be executed prior to execution of the sqoop job.- Sqoop configuration can be specified with a file, using the job-xml element, and inline, using the configuration elements.- Oozie EL expressions can be used in the inline configuration. Property values specified in the configuration element override values specified in the job-xml file.
Note that Hadoop mapred.job.tracker and fs.default.name properties must not be present in the inline configuration.
As with Hadoop map-reduce jobs, it is possible to add files and archives in order to make them available to the Sqoop job. 


Sqoop command:
The Sqoop command can be specified either using the command element or multiple arg elements.
- When using the command element, Oozie will split the command on every space into multiple arguments.- When using the arg elements, Oozie will pass each argument value as an argument to Sqoop.  The arg variant should be used when there are spaces within a single argument.  - All the above elements can be parameterized (templatized) using EL expressions.


Components of a workflow with sqoop action:
















Sample application

Highlights:
For this exercise, I have loaded some syslog generated logs to hdfs and created a hive table.
I have also created a table in mysql that will be the destination of a report (hive query) we will run 


Pictorial representation of the workflow:

Sample program:

Oozie web console:

Screenshots..

Monday, June 10, 2013

Apache Oozie - Part 2: Workflow - hive action

What's covered in the blog?

1. Documentation on the Oozie hive action
2. A sample workflow that includes fs action, email action, and hive action (query against some syslog generated log files).  

Version: 
Oozie 3.3.0

My other blogs on Oozie:

Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie workflow - shell action +passing output from one action to another
Blog 13: Oozie workflow - SSH action


Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

About the Hive action

http://archive.cloudera.com/cdh4/cdh/4/oozie/DG_HiveActionExtension.html

Salient features of the hive action:

- Runs the specified hive job synchronously (the workflow job will wait until the Hive job completes before continuing to the next action).
- Can be configured to create or delete HDFS directories before starting the Hive job.
- Supports Hive scripts with parameter variables, their syntax is ${VARIABLES} .
- Hive configuration needs to be specified as part of the job submission
Oozie EL expressions can be used in the inline configuration. Property values specified in the configuration element override values specified in the job-xml file.
Note that Hadoop mapred.job.tracker and fs.default.name properties must not be present in the inline configuration.
As with Hadoop map-reduce jobs, it is possible to add files and archives in order to make them available to the Hive job. 

Components of a workflow with hive action:

For a workflow with (just a) hive action, the following are required:
1.  workflow.xml
2.  job.properties
3.  Any files, archives, jars you want to add
4.  hive-site.xml
5.  Hive query scripts

Refer sample program below.

Sample program

Highlights:

The workflow application runs a report on data in Hive.  The input is log data (Syslog generated) in Hive, output is a table containing the report results in Hive. 

Pictorial overview of application:
















Application:


Oozie web console:

Screenshots of application execution:




Sunday, June 9, 2013

Apache Oozie - Part 1: Workflow with hdfs and email actions

What's covered in this blog?

Apache Oozie documentation (version 3.3.0) on - workflow, hdfs action, email action, and a sample application that moves files in hdfs (move and delete operations), and sends emails notifying status of workflow execution.  Sample data, commands, and output is also detailed.

My other blogs on Oozie:

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

1.0. About Apache Oozie

1.0.1. What is Apache Oozie?  

It is an extensible, scalable, and data-aware service to orchestrate Hadoop jobs, manage job dependencies, and execute jobs based on event triggers such as time and data availability.  

There are three types of jobs in Oozie:
1.  Oozie workflow jobs
 DAGS of actions which are jobs such as shell scripts, MapReduce, Sqoop, Streaming, Pipes, Pig, Hive etc.
2.  Oozie coordinator jobs 
Invoke Oozie workflow jobs based on specified event triggers - date/time, data availability.
3.  Oozie bundle jobs 
Related oozie coordinator jobs managed as a single job  

- An Oozie bundle job can have one to many coordinator jobs
- An Oozie coordinator job can have one to many workflow jobs
- An Oozie workflow can have one to many actions
- An Oozie workflow can have zero to many sub-workflows
   

1.0.2. Glossary of Oozie terminology

(From Apache Oozie documentation)
Action
An execution/computation task (Map-Reduce job, Pig job, a shell command). It can also be referred as task or 'action node'.
Workflow 
A collection of actions arranged in a control dependency DAG (Direct Acyclic Graph). "control dependency" from one action to another means that the second action can't run until the first action has completed.
Workflow Definition 
A programmatic description of a workflow that can be executed.
Workflow Definition Language
The language used to define a Workflow Definition.
Workflow Job 
An executable instance of a workflow definition.
Workflow Engine
A system that executes workflows jobs. It can also be referred as a DAG engine.


1.0.3. Oozie Architecture

Oozie is a Java Web-Application that runs in a Java servlet-container (Tomcat) and uses a database to store:
1.  Definitions of Oozie jobs - workflow/coordinator/bundle
2.  Currently running workflow instances, including instance states and variables

Oozie works with HSQL, Derby, MySQL, Oracle or PostgreSQL databases.  By default, Oozie is configured to use Embedded Derby.  Oozie bundles the JDBC drivers for HSQL, Embedded Derby and PostgreSQL.

For information about the different kinds of configuration such as User authentication, logging etc, refer:
http://oozie.apache.org/docs/3.3.0/AG_Install.html#Oozie_Configuration  


This diagram is from a Yahoo deck on Oozie..


2.0. Oozie Workflow

2.0.1 What is an Oozie workflow?

An Oozie workflow is a DAG of hadoop computation/processing tasks (referred to as "actions") and flow "controls" to coordinate the tasks and manage dependencies of actions and their results.  
2.01.1. Actions:
Oozie workflow actions start jobs on remote nodes, and upon completion of the same, the processes executing the jobs callback Oozie and notify completion in response to which Oozie will start the next action.  Actions can be hadoop fs , ssh, map reduce, hive, pig, sqoop, distcp, http, email commands or custom actions.   

2.0.1.2. Controls:
Controls manage the execution path of actions and include start, fork, join, decision and end.

2.0.1.3. Parameterizing actions and decisions:
Actions and decisions can be parameterized with job properties, actions output (i.e. Hadoop counters) and file information (file exists, file size, etc). Formal parameters are expressed in the workflow definition as ${VAR} variables.

2.0.1.4. Workflow application:
A workflow application is an instance of a workflow, and is essentially a zip file containing everything needed to execute the actions within the workflows -  the workflow definition (an XML file), JARs for Map/Reduce jobs, shells for streaming Map/Reduce jobs, native libraries, Hive/Pig/Sqoop scripts, files for distributed cache and other resource files.

2.0.1.5. Workflow definition:
A workflow definition is a DAG with control flow nodes and action nodes expressed in the XML based workflow definition language called hPDL (Hadoop Process Definition Language).

2.0.1.6. Workflow nodes:
Nodes encompassing actions in hPDL, are called action nodes, and nodes encompassing controls are called control flow nodes and together are referred to as workflow nodes.


2.0.2. Oozie control flow functionality

[Straight from Apache Oozie documentation]

2.0.2.1. Start control node
The start node is the entry point for a workflow job, it indicates the first workflow node the workflow job must transition to.
When a workflow is started, it automatically transitions to the node specified in the start .
A workflow definition must have one start node.

2.0.2.2. End control node
The end node is the end for a workflow job, it indicates that the workflow job has completed successfully.  When a workflow job reaches the end it finishes successfully (SUCCEEDED).  If one or more actions started by the workflow job are executing when the end node is reached, the actions will be killed. In this scenario the workflow job is still considered as successfully run.  A workflow definition must have one end node.

2.0.2.3. Kill control node
The kill node allows a workflow job to kill itself.  When a workflow job reaches the kill it finishes in error (KILLED).  If one or more actions started by the workflow job are executing when the kill node is reached, the actions will be killed.  A workflow definition may have zero or more kill nodes.

2.0.2.4. Decision node
A decision node enables a workflow to make a selection on the execution path to follow.  The behavior of a decision node can be seen as a switch-case statement.
A decision node consists of a list of predicates-transition pairs plus a default transition. Predicates are evaluated in order or appearance until one of them evaluates to true and the corresponding transition is taken. If none of the predicates evaluates to true the default transition is taken.  Predicates are JSP Expression Language (EL) expressions that resolve into a boolean value, true or false.  The default element in the decision node indicates the transition to take if none of the predicates evaluates to true.  All decision nodes must have a default element to avoid bringing the workflow into an error state if none of the predicates evaluates to true.

2.0.2.5. Fork/join control nodes
A fork node splits one path of execution into multiple concurrent paths of execution.  A join node waits until every concurrent execution path of a previous fork node arrives to it.  The fork and join nodes must be used in pairs. The join node assumes concurrent execution paths are children of the same fork node.


3.0. Oozie actions

Only two action types are covered in this blog.  More in subsequent blogs on oozie.

3.0.1. About the FS (hdfs) action

"The fs action allows to manipulate files and directories in HDFS from a workflow application. The supported commands are move , delete , mkdir , chmod , touchz and chgrp .
The FS commands are executed synchronously from within the FS action, the workflow job will wait until the specified file commands are completed before continuing to the next action.  Path names specified in the fs action can be parameterized (templatized) using EL expressions.  Each file path must specify the file system URI, for move operations, the target must not specified the system URI.

IMPORTANT: All the commands within fs action do not happen atomically, if a fs action fails half way in the commands being executed, successfully executed commands are not rolled back. The fs action, before executing any command must check that source paths exist and target paths don't exist (constraint regarding target relaxed for the move action. See below for details), thus failing before executing any command. Therefore the validity of all paths specified in one fs action are evaluated before any of the file operation are executed. Thus there is less chance of an error occurring while the fs action executes."

3.0.2. About the email action


The email action allows sending emails in Oozie from a workflow application. An email action must provide to addresses, cc addresses (optional), a subject and a body . Multiple recipients of an email can be provided as comma separated addresses.  The email action is executed synchronously, and the workflow job will wait until the specified emails are sent before continuing to the next action.

Apache documentation:
http://oozie.apache.org/docs/3.3.0/DG_EmailActionExtension.html



4.0. Building and executing an Oozie workflow with HDFS action and Email action

Pictorial overview





Sample program specifics


Oozie web console
Screenshot of entry of sample application-


This concludes this blog.  Happy hadooping!