Sunday, June 9, 2013

Apache Oozie - Part 1: Workflow with hdfs and email actions

What's covered in this blog?

Apache Oozie documentation (version 3.3.0) on - workflow, hdfs action, email action, and a sample application that moves files in hdfs (move and delete operations), and sends emails notifying status of workflow execution.  Sample data, commands, and output is also detailed.

My other blogs on Oozie:

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

1.0. About Apache Oozie

1.0.1. What is Apache Oozie?  

It is an extensible, scalable, and data-aware service to orchestrate Hadoop jobs, manage job dependencies, and execute jobs based on event triggers such as time and data availability.  

There are three types of jobs in Oozie:
1.  Oozie workflow jobs
 DAGS of actions which are jobs such as shell scripts, MapReduce, Sqoop, Streaming, Pipes, Pig, Hive etc.
2.  Oozie coordinator jobs 
Invoke Oozie workflow jobs based on specified event triggers - date/time, data availability.
3.  Oozie bundle jobs 
Related oozie coordinator jobs managed as a single job  

- An Oozie bundle job can have one to many coordinator jobs
- An Oozie coordinator job can have one to many workflow jobs
- An Oozie workflow can have one to many actions
- An Oozie workflow can have zero to many sub-workflows
   

1.0.2. Glossary of Oozie terminology

(From Apache Oozie documentation)
Action
An execution/computation task (Map-Reduce job, Pig job, a shell command). It can also be referred as task or 'action node'.
Workflow 
A collection of actions arranged in a control dependency DAG (Direct Acyclic Graph). "control dependency" from one action to another means that the second action can't run until the first action has completed.
Workflow Definition 
A programmatic description of a workflow that can be executed.
Workflow Definition Language
The language used to define a Workflow Definition.
Workflow Job 
An executable instance of a workflow definition.
Workflow Engine
A system that executes workflows jobs. It can also be referred as a DAG engine.


1.0.3. Oozie Architecture

Oozie is a Java Web-Application that runs in a Java servlet-container (Tomcat) and uses a database to store:
1.  Definitions of Oozie jobs - workflow/coordinator/bundle
2.  Currently running workflow instances, including instance states and variables

Oozie works with HSQL, Derby, MySQL, Oracle or PostgreSQL databases.  By default, Oozie is configured to use Embedded Derby.  Oozie bundles the JDBC drivers for HSQL, Embedded Derby and PostgreSQL.

For information about the different kinds of configuration such as User authentication, logging etc, refer:
http://oozie.apache.org/docs/3.3.0/AG_Install.html#Oozie_Configuration  


This diagram is from a Yahoo deck on Oozie..


2.0. Oozie Workflow

2.0.1 What is an Oozie workflow?

An Oozie workflow is a DAG of hadoop computation/processing tasks (referred to as "actions") and flow "controls" to coordinate the tasks and manage dependencies of actions and their results.  
2.01.1. Actions:
Oozie workflow actions start jobs on remote nodes, and upon completion of the same, the processes executing the jobs callback Oozie and notify completion in response to which Oozie will start the next action.  Actions can be hadoop fs , ssh, map reduce, hive, pig, sqoop, distcp, http, email commands or custom actions.   

2.0.1.2. Controls:
Controls manage the execution path of actions and include start, fork, join, decision and end.

2.0.1.3. Parameterizing actions and decisions:
Actions and decisions can be parameterized with job properties, actions output (i.e. Hadoop counters) and file information (file exists, file size, etc). Formal parameters are expressed in the workflow definition as ${VAR} variables.

2.0.1.4. Workflow application:
A workflow application is an instance of a workflow, and is essentially a zip file containing everything needed to execute the actions within the workflows -  the workflow definition (an XML file), JARs for Map/Reduce jobs, shells for streaming Map/Reduce jobs, native libraries, Hive/Pig/Sqoop scripts, files for distributed cache and other resource files.

2.0.1.5. Workflow definition:
A workflow definition is a DAG with control flow nodes and action nodes expressed in the XML based workflow definition language called hPDL (Hadoop Process Definition Language).

2.0.1.6. Workflow nodes:
Nodes encompassing actions in hPDL, are called action nodes, and nodes encompassing controls are called control flow nodes and together are referred to as workflow nodes.


2.0.2. Oozie control flow functionality

[Straight from Apache Oozie documentation]

2.0.2.1. Start control node
The start node is the entry point for a workflow job, it indicates the first workflow node the workflow job must transition to.
When a workflow is started, it automatically transitions to the node specified in the start .
A workflow definition must have one start node.

2.0.2.2. End control node
The end node is the end for a workflow job, it indicates that the workflow job has completed successfully.  When a workflow job reaches the end it finishes successfully (SUCCEEDED).  If one or more actions started by the workflow job are executing when the end node is reached, the actions will be killed. In this scenario the workflow job is still considered as successfully run.  A workflow definition must have one end node.

2.0.2.3. Kill control node
The kill node allows a workflow job to kill itself.  When a workflow job reaches the kill it finishes in error (KILLED).  If one or more actions started by the workflow job are executing when the kill node is reached, the actions will be killed.  A workflow definition may have zero or more kill nodes.

2.0.2.4. Decision node
A decision node enables a workflow to make a selection on the execution path to follow.  The behavior of a decision node can be seen as a switch-case statement.
A decision node consists of a list of predicates-transition pairs plus a default transition. Predicates are evaluated in order or appearance until one of them evaluates to true and the corresponding transition is taken. If none of the predicates evaluates to true the default transition is taken.  Predicates are JSP Expression Language (EL) expressions that resolve into a boolean value, true or false.  The default element in the decision node indicates the transition to take if none of the predicates evaluates to true.  All decision nodes must have a default element to avoid bringing the workflow into an error state if none of the predicates evaluates to true.

2.0.2.5. Fork/join control nodes
A fork node splits one path of execution into multiple concurrent paths of execution.  A join node waits until every concurrent execution path of a previous fork node arrives to it.  The fork and join nodes must be used in pairs. The join node assumes concurrent execution paths are children of the same fork node.


3.0. Oozie actions

Only two action types are covered in this blog.  More in subsequent blogs on oozie.

3.0.1. About the FS (hdfs) action

"The fs action allows to manipulate files and directories in HDFS from a workflow application. The supported commands are move , delete , mkdir , chmod , touchz and chgrp .
The FS commands are executed synchronously from within the FS action, the workflow job will wait until the specified file commands are completed before continuing to the next action.  Path names specified in the fs action can be parameterized (templatized) using EL expressions.  Each file path must specify the file system URI, for move operations, the target must not specified the system URI.

IMPORTANT: All the commands within fs action do not happen atomically, if a fs action fails half way in the commands being executed, successfully executed commands are not rolled back. The fs action, before executing any command must check that source paths exist and target paths don't exist (constraint regarding target relaxed for the move action. See below for details), thus failing before executing any command. Therefore the validity of all paths specified in one fs action are evaluated before any of the file operation are executed. Thus there is less chance of an error occurring while the fs action executes."

3.0.2. About the email action


The email action allows sending emails in Oozie from a workflow application. An email action must provide to addresses, cc addresses (optional), a subject and a body . Multiple recipients of an email can be provided as comma separated addresses.  The email action is executed synchronously, and the workflow job will wait until the specified emails are sent before continuing to the next action.

Apache documentation:
http://oozie.apache.org/docs/3.3.0/DG_EmailActionExtension.html



4.0. Building and executing an Oozie workflow with HDFS action and Email action

Pictorial overview





Sample program specifics


Oozie web console
Screenshot of entry of sample application-


This concludes this blog.  Happy hadooping!  

51 comments:

  1. Great work,you are really changing the world.
    Thanks.

    knowledge is the greatest gift ever!!!

    ReplyDelete
  2. Thank you ,these posts were very useful.

    ReplyDelete
  3. I facing the issue when I am tring to configurit with smtp.gmail.com@465.I think ssl property should be true can.Let me how can we set ssl as true.

    ReplyDelete
  4. Great work..... Thnx a lot for the efforts
    :) :) :)

    ReplyDelete
  5. hi, i am using cloudera quick start vm . I followed the instn given by you. when i run submit command it is giving http error: no response message. pls help

    ReplyDelete
  6. good work, thanks for sharing..

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Nice post...thanks for sharing.
    Could you please help on this..i am ending with the below error.

    org.apache.oozie.action.ActionExecutorException: EM007: Encountered an error while sending the email message over SMTP.

    ReplyDelete
  9. Srinivas, have you resolved this issue ? any idea what causes this ? thanks ..

    ReplyDelete
  10. Very nice post, very informative blog.
    Thanks for sharing knowledge.

    ReplyDelete
  11. I am new to this env. I need to compare staging target table in oracle database against Source Hive data for 40 million data ,Please advise the optimal way of writing the code.

    ReplyDelete
  12. This comment has been removed by a blog administrator.

    ReplyDelete
  13. Thank You for covering detailed apache oozie workflow with hdfs This helps a lot in HADOOP developer training.Thank you for chance to Lear from real time experts like you.

    ReplyDelete
  14. thank yo very much providing valuable information.we are very glad to leave a comment .please provide as many as articles as you can.it is very helpful to hadoop learners please also visit our blog Hadoop training in hyderabad
    Hadoop classes in hyderabad visit:-www.rstrainings.com

    ReplyDelete
  15. First of all thank you for sharing blog.. Before reading this blog i have no knowledge of oozie but now i learnt more useful information about oozie from this blog.. thanks a lot for sharing this blog to us

    hadoop training in velachery | big data training in velachery

    ReplyDelete
  16. Great and helpful blog to everyone.. This blog having more useful information which having clear explanation so easy and interesting to read.. This blog really useful to develop my knowledge in hadoop and cracking interview easily.. thanks a lot for sharing this blog to us...

    hadoop training institute in chennai | big data training institute in chennai | hadoop training in velachery

    ReplyDelete
  17. These provided information was really so nice,thanks for giving that post and the more skills to develop after refer that post. Your articles really impressed for me,because of all information so nice.

    Hadoop Training in Chennai

    Java Training in Chennai

    ReplyDelete
  18. Good work Sir, Thanks for the proper explanation about HDFS. I found one of the good resource related to HDFS and Hadoop. It is providing in-depth knowledge on HDFS and HDFS Architecture. which I am sharing a link with you where you can get more clear on HDFS and Hadoop. To know more Just have a look at Below link

    HDFS
    Hadoop
    HDFS Architecture
    .

    ReplyDelete
  19. hi!!! i had a complete knowledge on oozie especially the github links was very useful to understand
    about the workflow with example output ;great job . Hadoop Training in Velachery .

    ReplyDelete
  20. Thanks for the explanation. It’s really helpful. Please keep sharing.
    Hadoop Course in delhi

    ReplyDelete
  21. Thanks for sharing the descriptive information on Python course. It’s really helpful to me since I'm taking Python training. Keep doing the good work and if you are interested to know more on Python, do check this Python tutorial.https://www.youtube.com/watch?v=1jMR4cHBwZE

    ReplyDelete
  22. Nice blog. Thank you for sharing such useful post. Keep posting
    Hadoop Training in Gurgaon

    ReplyDelete
  23. Thanks for sharing this blog. This is so informative blog
    Hadoop Training in Noida

    ReplyDelete
  24. http://therpalab.com/automation-anywhere
    http://therpalab.com
    http://therpalab.com/blue-prism

    from therpalab

    ReplyDelete
  25. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    Regards,
    Big Data Hadoop Training in electronic city, Bangalore

    ReplyDelete
  26. After reading this blog i very strong in this topics and this blog really helpful to all.. Big data Hadoop online Course Hyderabad

    ReplyDelete
  27. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  28. This is really great informative blog. Keep sharing Full Stack Training in Hyderabad

    ReplyDelete
  29. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  30. Big Data Analysing different types of big data sets.Big Data is used for better understand the customer .In future Big Data is used help to set the lager and complex data management application.

    Big Data Training in Gurgaon

    ReplyDelete
  31. Very impressive article! The blog is highly informative and has answered all my questions. To introduce about our company and the activities, Techno Data Group
    is a database provider that helps you to boost your sales & grow your business through well-build Hadoop Users Email.

    ReplyDelete
  32. Hey.. I checked your blog its really useful.. Provides lot of information.. Do check my blogs also on https://exploring2gether.com/

    ReplyDelete
  33. hey...It is highly comprehensive and elaborated. Thanks for sharing!

    Localebazar- Your single guide for exploring delicious foods, travel diaries and fitness stories.

    Visit us for more- localebazar.com

    ReplyDelete
  34. Thanks for sharing such helpful information with us I appreciate your effort of writing a value able piece.
    i also write on Spanish Language Course in Delhi.
    Please share your review on that.

    ReplyDelete
  35. Great Article… its is very very helpful for all of us and I never get bored while reading your article because, they are becomes a more and more interesting from the starting lines until the end.

    Artificial Intelligence Certification Training
    Java Certification Training
    AWS Certification Training
    Machine Learning Certification Training
    Data Science Certification Training
    DevOps Certification Training

    ReplyDelete
  36. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Digital Marketing Certification Training
    Python Certification Training
    Selenium Training

    ReplyDelete
  37. very nice explanation,keep posting more Blog articles with us.

    hadoop admin online course

    ReplyDelete
  38. awesome content you have shared on your blog
    you can check our GYC silicon straps high quality printing premium looking bands straps compatible for Mi Xiomi BAND 3 BAND 4. Click on the link given below

    CLICK HERE
    CLICK HERE
    CLICK HERE
    CLICK HERE

    ReplyDelete
  39. The SkoolBeep platform provides schools the option to conduct online classes for your students.
    https://www.skoolbeep.com/online-classes/

    ReplyDelete
    Replies
    1. [url="https://www.skoolbeep.com/online-classes/"]online classes app[/url]

      Delete