Tuesday, July 16, 2013

Apache Oozie - Part 8: Subworkflow


1.0. What's covered in the blog?

1) Apache documentation on sub-workflows
2) A sample program that includes components of a oozie workflow application with a java main action and a subworkflow containing a sqoop action.  Scripts/code, sample dataset and commands are included;  Oozie actions covered: java action, sqoop action (mysql database); 

Versions:
Oozie 3.3.0, Sqoop (1.4.2) with Mysql (5.1.69)

Related blogs:
Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11a: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie workflow - shell action +passing output from one action to another


2.0. Apache documentation on sub-workflows


The sub-workflow action runs a child workflow job, the child workflow job can be in the same Oozie system or in another Oozie system.  The parent workflow job will wait until the child workflow job has completed.

Syntax:






















The child workflow job runs in the same Oozie system instance where the parent workflow job is running.
The app-path element specifies the path to the workflow application of the child workflow job.
The propagate-configuration flag, if present, indicates that the workflow job configuration should be propagated to the child workflow.

The configuration section can be used to specify the job properties that are required to run the child workflow job.  The configuration of the sub-workflow action can be parameterized (templatized) using EL expressions.

Link to Apache documentation:
http://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a3.2.6_Sub-workflow_Action

Note:
For a typical on-demand workflow, you have core components - job.properties and workflow.xml.  For a sub workflow, you need yet another workflow.xml that clearly defines activities to occur in the sub-workflow.  In the parent workflow, the sub-workflow is referenced.  To keep it neat, best to have a sub-directory to hold the sub-workflow core components.  Also, a single job.properties is sufficient. 

3.0. Sample workflow application

Highlights:
The workflow has two actions - one is a java main action and the other is a sub-workflow action.

The java main action parses log files on hdfs and generates a report.
The sub-workflow action executes after success of the java main action, and pipes the report in hdfs to mysql database.


Pictorial overview:





Components of such a workflow application:
























Application details:




Oozie web console - screenshots:







































Thursday, July 11, 2013

Apache Oozie - Part 10: Bundle jobs



1.0. What's covered in the blog?

1) Apache documentation on bundle jobs
2) A sample bundle application with two coordinator apps - one that it time triggered, another that is dataset availability triggered. Oozie actions covered: hdfs action, email action, java main action, sqoop action (mysql database); Includes oozie job property files, workflow xml files, sample data (syslog generated files), java program (jar) for log parsing, commands;  

Version:
Oozie 3.3.0;

Related blogs:
Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered, with sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11a: Oozie Java API for interfacing with oozie workflows
Blog 11b: Oozie Web Service API for interfacing with oozie workflows


2.0. About Oozie bundle jobs

Excerpt from Apache documentation-
Bundle is a higher-level oozie abstraction that will batch a set of coordinator applications. The user will be able to start/stop/suspend/resume/rerun in the bundle level resulting a better and easy operational control.
More specififcally, the oozie Bundle system allows the user to define and execute a bunch of coordinator applications often called a data pipeline. There is no explicit dependency among the coordinator applications in a bundle. However, a user could use the data dependency of coordinator applications to create an implicit data application pipeline.

Apache documentation:
http://oozie.apache.org/docs/3.3.0/BundleFunctionalSpec.html#a1._Bundle_Overview


A bundle job can have one to many coordinator jobs.
A coordinator job can have one to many workflows.
A workflow can have one to many actions.


3.0. Sample program

Highlights:
The sample bundle application is time triggered.  The start time is defined in the bundle job.properties file.  The bundle application starts two coordinator applications- as defined in the bundle definition file - bundleConfirguration.xml.

The first coordinator job is time triggered.  The start time is defined in the bundle job.properties file.  It runs a workflow, that includes a java main action.  The java program parses some log files and generates a report.  The output of the java action is a dataset (the report) which is the trigger for the next coordinator job.

The second coordinator job gets triggered upon availability of the file _SUCCESS in the output directory of the workflow application of the first coordinator application.  It executes a workflow that has a sqoop action;  The sqoop action pipes the output of the first coordinator job to a mysql database.


Pictorial overview of the job:
Components of the bundle application:


Bundle application details:



Oozie web console - screenshots:







Wednesday, July 10, 2013

Apache Oozie - Part 9c: Coordinator job - dataset availability triggered


1.0. What's covered in the blog?

1) Apache documentation on cooridnator jobs that execute workflows upon availability of datasets
2) A sample program that includes components of a oozie, dataset availability initiated, coordinator job - scripts/code, sample dataset and commands;  Oozie actions covered: hdfs action, email action, sqoop action (mysql database); 

Version:
Oozie 3.3.0;

Related blogs:

Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie workflow - shell action +passing output from one action to another
Blog 13: Oozie workflow - SSH action


2.0. Apache documentation on dataset availability triggered coordinator jobs

http://oozie.apache.org/docs/3.3.0/CoordinatorFunctionalSpec.html

3.0. Sample coordinator application


Highlights:
The coordinator application has a start time, and when the start time condition is met, it will transition to waiting state where it will look for the availability of a dataset.  Once the dataset is available, it will run the workflow specified.


Sample application - pictorial overview:


Coordinator application components:




















Coordinator application details:



Oozie web console output:
Screenshots from the execution of the sample program..


















Upon availability of the dataset...









Tuesday, July 9, 2013

Apache Oozie - Part 9b: Coordinator jobs - (trigger) file triggered


1.0. What's covered in the blog?

A sample application that includes components of a Oozie (trigger) file triggered coordinator job - scripts/code, sample data (Syslog generated log files) and commands;  Oozie actions covered: hdfs action, email action, java main action, hive action;  Oozie controls covered: decision, fork-join; The workflow includes a sub-workflow that runs two hive actions concurrently.  The hive table is partitioned; Parsing - hive-regex, and Java-regex.  Also, the java mapper, gets the input directory path and includes part of it in the key.

Version:
Oozie 3.3.0;

Related blogs:

Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie workflow - shell action +passing output from one action to another
Blog 13: Oozie workflow - SSH action

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

2.0. Sample coordinator application

Highlights:
The coordinator application starts executing upon availability of the trigger file defined and initiates the two workflows.  Both workflows generate reports off of data in hdfs.
The java main action parses log files and generates a report.  
The hive actions in the hive sub-workflow run reports off of hive tables against the same log files in hdfs.

Pictorial overview of coordinator application:





Components:





























Coordinator application details:




Oozie web console - screenshots:











Thursday, July 4, 2013

Apache Oozie - Part 9a: Coordinator jobs - time triggered; fork-join and decision controls

1.0. What's covered in the blog?

1. Oozie documentation on coordinator job, sub workflow, fork-join, and decision controls
2. A sample application that includes components of a oozie time triggered coordinator job - scripts/code, sample data 
and commands;  Oozie actions covered: hdfs action, email action, java main action, hive action;  Oozie controls covered: decision, fork-join; The workflow includes a sub-workflow that runs two hive actions concurrently.  The hive table is partitioned; Parsing - hive-regex, and Java-regex.  Also, the java mapper, gets the input directory path and includes part of it in the key.

Version:
Oozie 3.3.0;

Related blogs:

Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie workflow - shell action +passing output from one action to another
Blog 13: Oozie workflow - SSH action


Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

2.0. Oozie sub-workflow

The sub-workflow action runs a child workflow job, the child workflow job can be in the same Oozie system or in another Oozie system.  The parent workflow job will wait until the child workflow job has completed.

Syntax:






















The child workflow job runs in the same Oozie system instance where the parent workflow job is running.
The app-path element specifies the path to the workflow application of the child workflow job.
The propagate-configuration flag, if present, indicates that the workflow job configuration should be propagated to the child workflow.

The configuration section can be used to specify the job properties that are required to run the child workflow job.  The configuration of the sub-workflow action can be parameterized (templatized) using EL expressions.

Link to Apache documentation:
http://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a3.2.6_Sub-workflow_Action

Note:
For a typical on-demand workflow, you have core components - job.properties and workflow.xml.  For a sub workflow, you need yet another workflow.xml that clearly defines activities to occur in the sub-workflow.  In the parent workflow, the sub-workflow is referenced.  To keep it neat, best to have a sub-directory to hold the sub-workflow core components.  Also, a single job.properties is sufficient. 

E.g.
workflowAppPath
    workflow.xml
    job.properties
    Any other lib/archives/files etc

    subWorkflowAppPath
        workflow.xml

       

3.0. Coordinator job

Users typically run map-reduce, hadoop-streaming, hdfs and/or Pig jobs on the grid. Multiple of these jobs can be combined to form a workflow job. Oozie, Hadoop Workflow Systemdefines a workflow system that runs such jobs.

Commonly, workflow jobs are run based on regular time intervals and/or data availability. And, in some cases, they can be triggered by an external event.  Expressing the condition(s) that trigger a workflow job can be modeled as a predicate that has to be satisfied. 

The workflow job is started after the predicate is satisfied. A predicate can reference to data, time and/or external events. In the future, the model can be extended to support additional event types.
It is also necessary to connect workflow jobs that run regularly, but at different time intervals. The outputs of multiple subsequent runs of a workflow become the input to the next workflow. For example, the outputs of last 4 runs of a workflow that runs every 15 minutes become the input of another workflow that runs every 60 minutes. Chaining together these workflows result it is referred as a data application pipeline.

The Oozie Coordinator system allows the user to define and execute recurrent and interdependent workflow jobs (data application pipelines).  Real world data application pipelines have to account for reprocessing, late processing, catchup, partial processing, monitoring, notification and SLAs.

Link to Apache documentation:
http://oozie.apache.org/docs/3.3.0/CoordinatorFunctionalSpec.html


4.0. Decision control

decision node enables a workflow to make a selection on the execution path to follow.  The behavior of a decision node can be seen as a switch-case statement.

decision node consists of a list of predicates-transition pairs plus a default transition. Predicates are evaluated in order or appearance until one of them evaluates to true and the corresponding transition is taken. If none of the predicates evaluates to true the default transition is taken.

Predicates are JSP Expression Language (EL) expressions (refer to section 4.2 of this document) that resolve into a boolean value, true or false.  For example:
${fs:fileSize('/usr/foo/myinputdir') gt 10 * GB}

Syntax:
The name attribute in the decision node is the name of the decision node.
Each case elements contains a predicate an a transition name. The predicate ELs are evaluated in order until one returns true and the corresponding transition is taken.

The default element indicates the transition to take if none of the predicates evaluates to true .
All decision nodes must have a default element to avoid bringing the workflow into an error state if none of the predicates evaluates to true.

Link to Apache documentation:
http://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a3.1.4_Decision_Control_Node


5.0. Fork-Join controls

fork node splits one path of execution into multiple concurrent paths of execution.
join node waits until every concurrent execution path of a previous fork node arrives to it.
The fork and join nodes must be used in pairs. 
The join node assumes concurrent execution paths are children of the same fork node.

Syntax:


The name attribute in the fork node is the name of the workflow fork node. The start attribute in the path elements in the fork node indicate the name of the workflow node that will be part of the concurrent execution paths.

The name attribute in the join node is the name of the workflow join node. The to attribute in the join node indicates the name of the workflow node that will executed after all concurrent execution paths of the corresponding fork arrive to the join node.

Link to Apache documentation:
http://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes

6.0. Helpful sites

7.0. Sample coordinator application
Highlights:
The sample application includes components of a oozie (time initiated) coordinator application - scripts/code, sample data and commands;  Oozie actions covered: hdfs action, email action, java main action, hive action;  Oozie controls covered: decision, fork-join; The workflow includes a sub-workflow that runs two hive actions concurrently.  The hive table is partitioned; Parsing uses hive-regex, and Java-regex.  Also, the java mapper, gets the input directory path and includes part of it in the key.

Pictorial overview of application:






Components of application:


Application details:


Oozie web console:
Screenshots from execution of sample program