Sunday, June 9, 2013

Apache Oozie - Part 1: Workflow with hdfs and email actions

What's covered in this blog?

Apache Oozie documentation (version 3.3.0) on - workflow, hdfs action, email action, and a sample application that moves files in hdfs (move and delete operations), and sends emails notifying status of workflow execution.  Sample data, commands, and output is also detailed.

My other blogs on Oozie:

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

1.0. About Apache Oozie

1.0.1. What is Apache Oozie?  

It is an extensible, scalable, and data-aware service to orchestrate Hadoop jobs, manage job dependencies, and execute jobs based on event triggers such as time and data availability.  

There are three types of jobs in Oozie:
1.  Oozie workflow jobs
 DAGS of actions which are jobs such as shell scripts, MapReduce, Sqoop, Streaming, Pipes, Pig, Hive etc.
2.  Oozie coordinator jobs 
Invoke Oozie workflow jobs based on specified event triggers - date/time, data availability.
3.  Oozie bundle jobs 
Related oozie coordinator jobs managed as a single job  

- An Oozie bundle job can have one to many coordinator jobs
- An Oozie coordinator job can have one to many workflow jobs
- An Oozie workflow can have one to many actions
- An Oozie workflow can have zero to many sub-workflows
   

1.0.2. Glossary of Oozie terminology

(From Apache Oozie documentation)
Action
An execution/computation task (Map-Reduce job, Pig job, a shell command). It can also be referred as task or 'action node'.
Workflow 
A collection of actions arranged in a control dependency DAG (Direct Acyclic Graph). "control dependency" from one action to another means that the second action can't run until the first action has completed.
Workflow Definition 
A programmatic description of a workflow that can be executed.
Workflow Definition Language
The language used to define a Workflow Definition.
Workflow Job 
An executable instance of a workflow definition.
Workflow Engine
A system that executes workflows jobs. It can also be referred as a DAG engine.


1.0.3. Oozie Architecture

Oozie is a Java Web-Application that runs in a Java servlet-container (Tomcat) and uses a database to store:
1.  Definitions of Oozie jobs - workflow/coordinator/bundle
2.  Currently running workflow instances, including instance states and variables

Oozie works with HSQL, Derby, MySQL, Oracle or PostgreSQL databases.  By default, Oozie is configured to use Embedded Derby.  Oozie bundles the JDBC drivers for HSQL, Embedded Derby and PostgreSQL.

For information about the different kinds of configuration such as User authentication, logging etc, refer:
http://oozie.apache.org/docs/3.3.0/AG_Install.html#Oozie_Configuration  


This diagram is from a Yahoo deck on Oozie..


2.0. Oozie Workflow

2.0.1 What is an Oozie workflow?

An Oozie workflow is a DAG of hadoop computation/processing tasks (referred to as "actions") and flow "controls" to coordinate the tasks and manage dependencies of actions and their results.  
2.01.1. Actions:
Oozie workflow actions start jobs on remote nodes, and upon completion of the same, the processes executing the jobs callback Oozie and notify completion in response to which Oozie will start the next action.  Actions can be hadoop fs , ssh, map reduce, hive, pig, sqoop, distcp, http, email commands or custom actions.   

2.0.1.2. Controls:
Controls manage the execution path of actions and include start, fork, join, decision and end.

2.0.1.3. Parameterizing actions and decisions:
Actions and decisions can be parameterized with job properties, actions output (i.e. Hadoop counters) and file information (file exists, file size, etc). Formal parameters are expressed in the workflow definition as ${VAR} variables.

2.0.1.4. Workflow application:
A workflow application is an instance of a workflow, and is essentially a zip file containing everything needed to execute the actions within the workflows -  the workflow definition (an XML file), JARs for Map/Reduce jobs, shells for streaming Map/Reduce jobs, native libraries, Hive/Pig/Sqoop scripts, files for distributed cache and other resource files.

2.0.1.5. Workflow definition:
A workflow definition is a DAG with control flow nodes and action nodes expressed in the XML based workflow definition language called hPDL (Hadoop Process Definition Language).

2.0.1.6. Workflow nodes:
Nodes encompassing actions in hPDL, are called action nodes, and nodes encompassing controls are called control flow nodes and together are referred to as workflow nodes.


2.0.2. Oozie control flow functionality

[Straight from Apache Oozie documentation]

2.0.2.1. Start control node
The start node is the entry point for a workflow job, it indicates the first workflow node the workflow job must transition to.
When a workflow is started, it automatically transitions to the node specified in the start .
A workflow definition must have one start node.

2.0.2.2. End control node
The end node is the end for a workflow job, it indicates that the workflow job has completed successfully.  When a workflow job reaches the end it finishes successfully (SUCCEEDED).  If one or more actions started by the workflow job are executing when the end node is reached, the actions will be killed. In this scenario the workflow job is still considered as successfully run.  A workflow definition must have one end node.

2.0.2.3. Kill control node
The kill node allows a workflow job to kill itself.  When a workflow job reaches the kill it finishes in error (KILLED).  If one or more actions started by the workflow job are executing when the kill node is reached, the actions will be killed.  A workflow definition may have zero or more kill nodes.

2.0.2.4. Decision node
A decision node enables a workflow to make a selection on the execution path to follow.  The behavior of a decision node can be seen as a switch-case statement.
A decision node consists of a list of predicates-transition pairs plus a default transition. Predicates are evaluated in order or appearance until one of them evaluates to true and the corresponding transition is taken. If none of the predicates evaluates to true the default transition is taken.  Predicates are JSP Expression Language (EL) expressions that resolve into a boolean value, true or false.  The default element in the decision node indicates the transition to take if none of the predicates evaluates to true.  All decision nodes must have a default element to avoid bringing the workflow into an error state if none of the predicates evaluates to true.

2.0.2.5. Fork/join control nodes
A fork node splits one path of execution into multiple concurrent paths of execution.  A join node waits until every concurrent execution path of a previous fork node arrives to it.  The fork and join nodes must be used in pairs. The join node assumes concurrent execution paths are children of the same fork node.


3.0. Oozie actions

Only two action types are covered in this blog.  More in subsequent blogs on oozie.

3.0.1. About the FS (hdfs) action

"The fs action allows to manipulate files and directories in HDFS from a workflow application. The supported commands are move , delete , mkdir , chmod , touchz and chgrp .
The FS commands are executed synchronously from within the FS action, the workflow job will wait until the specified file commands are completed before continuing to the next action.  Path names specified in the fs action can be parameterized (templatized) using EL expressions.  Each file path must specify the file system URI, for move operations, the target must not specified the system URI.

IMPORTANT: All the commands within fs action do not happen atomically, if a fs action fails half way in the commands being executed, successfully executed commands are not rolled back. The fs action, before executing any command must check that source paths exist and target paths don't exist (constraint regarding target relaxed for the move action. See below for details), thus failing before executing any command. Therefore the validity of all paths specified in one fs action are evaluated before any of the file operation are executed. Thus there is less chance of an error occurring while the fs action executes."

3.0.2. About the email action


The email action allows sending emails in Oozie from a workflow application. An email action must provide to addresses, cc addresses (optional), a subject and a body . Multiple recipients of an email can be provided as comma separated addresses.  The email action is executed synchronously, and the workflow job will wait until the specified emails are sent before continuing to the next action.

Apache documentation:
http://oozie.apache.org/docs/3.3.0/DG_EmailActionExtension.html



4.0. Building and executing an Oozie workflow with HDFS action and Email action

Pictorial overview





Sample program specifics


Oozie web console
Screenshot of entry of sample application-


This concludes this blog.  Happy hadooping!  

45 comments:

  1. Great work,you are really changing the world.
    Thanks.

    knowledge is the greatest gift ever!!!

    ReplyDelete
  2. Thank you ,these posts were very useful.

    ReplyDelete
  3. I facing the issue when I am tring to configurit with smtp.gmail.com@465.I think ssl property should be true can.Let me how can we set ssl as true.

    ReplyDelete
  4. Great work..... Thnx a lot for the efforts
    :) :) :)

    ReplyDelete
  5. hi, i am using cloudera quick start vm . I followed the instn given by you. when i run submit command it is giving http error: no response message. pls help

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. Nice post...thanks for sharing.
    Could you please help on this..i am ending with the below error.

    org.apache.oozie.action.ActionExecutorException: EM007: Encountered an error while sending the email message over SMTP.

    ReplyDelete
  8. Srinivas, have you resolved this issue ? any idea what causes this ? thanks ..

    ReplyDelete
  9. Very nice post, very informative blog.
    Thanks for sharing knowledge.

    ReplyDelete
  10. I am new to this env. I need to compare staging target table in oracle database against Source Hive data for 40 million data ,Please advise the optimal way of writing the code.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. Thank You for covering detailed apache oozie workflow with hdfs This helps a lot in HADOOP developer training.Thank you for chance to Lear from real time experts like you.

    ReplyDelete
  13. thank yo very much providing valuable information.we are very glad to leave a comment .please provide as many as articles as you can.it is very helpful to hadoop learners please also visit our blog Hadoop training in hyderabad
    Hadoop classes in hyderabad visit:-www.rstrainings.com

    ReplyDelete
  14. First of all thank you for sharing blog.. Before reading this blog i have no knowledge of oozie but now i learnt more useful information about oozie from this blog.. thanks a lot for sharing this blog to us

    hadoop training in velachery | big data training in velachery

    ReplyDelete
  15. Great and helpful blog to everyone.. This blog having more useful information which having clear explanation so easy and interesting to read.. This blog really useful to develop my knowledge in hadoop and cracking interview easily.. thanks a lot for sharing this blog to us...

    hadoop training institute in chennai | big data training institute in chennai | hadoop training in velachery

    ReplyDelete
  16. Nice blog. Really helpful for learningHadoop and keep update on some more tutorials….. I liked your blog.

    ReplyDelete
  17. Thank you so much for sharing this worth able content with us. The concept taken here will be useful for my future programs and i will surely implement them in my study. Keep blogging article like this.

    Hadoop Online Training
    Data Science Online Training

    ReplyDelete
  18. These provided information was really so nice,thanks for giving that post and the more skills to develop after refer that post. Your articles really impressed for me,because of all information so nice.

    Hadoop Training in Chennai

    Java Training in Chennai

    ReplyDelete
  19. Revanth Technologies is a vast experienced online training center in Hyderabad, India since 2006, with highly qualified and real time experienced faculties, offers Python online training with real time project scenarios.

    In the course training we are covering Types and Operations,Statements and Syntax,Functions,Modules,Classes and OOP, Exceptions and Tools etc..

    For more details please contact: 9290971883
    Mail id: revanthonlinetraining@gmail.com


    For course content and more details please visit
    http://www.revanthtechnologies.com/python-online-training-from-india.php

    ReplyDelete
  20. Revanth Technologies is a vast experienced online training center in Hyderabad, India since 2006, with highly qualified and real time experienced faculties, offers Python online training with real time project scenarios.

    In the course training we are covering Types and Operations,Statements and Syntax,Functions,Modules,Classes and OOP, Exceptions and Tools etc..

    For more details please contact: 9290971883
    Mail id: revanthonlinetraining@gmail.com


    For course content and more details please visit
    http://www.revanthtechnologies.com/python-online-training-from-india.php

    ReplyDelete
  21. Good work Sir, Thanks for the proper explanation about HDFS. I found one of the good resource related to HDFS and Hadoop. It is providing in-depth knowledge on HDFS and HDFS Architecture. which I am sharing a link with you where you can get more clear on HDFS and Hadoop. To know more Just have a look at Below link

    HDFS
    Hadoop
    HDFS Architecture
    .

    ReplyDelete
  22. Splunk is a software that enables, and manages search data from any application, server, and network device in no time. Splunk makes machine data reachable, utilizable and helpful to everyone. It enables the curious to look closely at what others ignore machine data and find what others never see: insights that can help make your company more productive, profitable, competitive and secure.

    Revanth Technologies offers Splunk online training with real time concepts and with real time scenarios. With Revanth Technologies provides deep knowledge of Splunk services and their connectivity. Every student will understand high availability concepts and implementations. After finishing this Course in Revanth Technologies you will be well versed in Splunk installation and configuration, Splunk indexes, monitoring and scaling large volumes of search, Report creation, analyzing and sorting data with the Splunk tool. We are providing a free demo class for the students.

    Our Online Training Institute's unique features.

    1. The trainers have ample experience in this field of work.
    2. 24/7 support will be provided.
    3. Mock Interviews will be conducted to the students who completed online trainings.
    4. Online Training timings will be as per student's convenience only.
    5. All the doubts arised at the time of class will be cleared by the trainers.
    6. All the training is based on real time scenarios only.


    For more details please call us on +91 9290971883, 9247461324 or drop a mail to revanthonlinetraining@gmail.
    For course content please click on the below link
    http://www.revanthtechnologies.com/splunk-online-training-from-india.php

    ReplyDelete
  23. Splunk is a software that enables, and manages search data from any application, server, and network device in no time. Splunk makes machine data reachable, utilizable and helpful to everyone. It enables the curious to look closely at what others ignore machine data and find what others never see: insights that can help make your company more productive, profitable, competitive and secure.

    Revanth Technologies offers Splunk online training with real time concepts and with real time scenarios. With Revanth Technologies provides deep knowledge of Splunk services and their connectivity. Every student will understand high availability concepts and implementations. After finishing this Course in Revanth Technologies you will be well versed in Splunk installation and configuration, Splunk indexes, monitoring and scaling large volumes of search, Report creation, analyzing and sorting data with the Splunk tool. We are providing a free demo class for the students.

    Our Online Training Institute's unique features.

    1. The trainers have ample experience in this field of work.
    2. 24/7 support will be provided.
    3. Mock Interviews will be conducted to the students who completed online trainings.
    4. Online Training timings will be as per student's convenience only.
    5. All the doubts arised at the time of class will be cleared by the trainers.
    6. All the training is based on real time scenarios only.


    For more details please call us on +91 9290971883, 9247461324 or drop a mail to revanthonlinetraining@gmail.
    For course content please click on the below link
    http://www.revanthtechnologies.com/splunk-online-training-from-india.php

    ReplyDelete
  24. Splunk is a software that enables, and manages search data from any application, server, and network device in no time. Splunk makes machine data reachable, utilizable and helpful to everyone. It enables the curious to look closely at what others ignore machine data and find what others never see: insights that can help make your company more productive, profitable, competitive and secure.

    Revanth Technologies offers Splunk online training with real time concepts and with real time scenarios. With Revanth Technologies provides deep knowledge of Splunk services and their connectivity. Every student will understand high availability concepts and implementations. After finishing this Course in Revanth Technologies you will be well versed in Splunk installation and configuration, Splunk indexes, monitoring and scaling large volumes of search, Report creation, analyzing and sorting data with the Splunk tool. We are providing a free demo class for the students.

    Our Online Training Institute's unique features.

    1. The trainers have ample experience in this field of work.
    2. 24/7 support will be provided.
    3. Mock Interviews will be conducted to the students who completed online trainings.
    4. Online Training timings will be as per student's convenience only.
    5. All the doubts arised at the time of class will be cleared by the trainers.
    6. All the training is based on real time scenarios only.


    For more details please call us on +91 9290971883, 9247461324 or drop a mail to revanthonlinetraining@gmail.
    For course content please click on the below link
    http://www.revanthtechnologies.com/splunk-online-training-from-india.php

    ReplyDelete
  25. Really it was an awesome article...very interesting to read..You have provided an nice article....Thanks for sharing..
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  26. The Blog Content is very informative and helpful. Please share more content. Thanks.
    Hadoop Training in Gurgaon
    Hadoop Institute in Gurgaon
    Hadoop Course in Gurgaon

    ReplyDelete
  27. hi!!! i had a complete knowledge on oozie especially the github links was very useful to understand
    about the workflow with example output ;great job . Hadoop Training in Velachery .

    ReplyDelete
  28. very interesting , good job and thanks for sharing such a good blog. artificial intelligence

    ReplyDelete
  29. Thanks for the explanation. It’s really helpful. Please keep sharing.
    Hadoop Course in delhi

    ReplyDelete
  30. Nice post .Really appreciable. Please share more information. Thanks you
    Hadoop training in Noida

    ReplyDelete
  31. Thanks for sharing the descriptive information on Python course. It’s really helpful to me since I'm taking Python training. Keep doing the good work and if you are interested to know more on Python, do check this Python tutorial.https://www.youtube.com/watch?v=1jMR4cHBwZE

    ReplyDelete
  32. Nice blog. Thank you for sharing such useful post. Keep posting
    Hadoop Training in Gurgaon

    ReplyDelete
  33. Thanks for sharing this blog. This is so informative blog
    Hadoop Training in Noida

    ReplyDelete
  34. http://therpalab.com/automation-anywhere
    http://therpalab.com
    http://therpalab.com/blue-prism

    from therpalab

    ReplyDelete
  35. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    Regards,
    Big Data Hadoop Training in electronic city, Bangalore

    ReplyDelete
  36. After reading this blog i very strong in this topics and this blog really helpful to all.. Big data Hadoop online Course Hyderabad

    ReplyDelete
  37. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete