What's covered in this blog?
Apache Oozie documentation (version 3.3.0) on - workflow, hdfs action, email action, and a sample application that moves files in hdfs (move and delete operations), and sends emails notifying status of workflow execution. Sample data, commands, and output is also detailed.
Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11a: Oozie Java API for interfacing with oozie workflows
Blog 11b: Oozie Web Service API for interfacing with oozie workflows
Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.
1.0. About Apache Oozie
1.0.1. What is Apache Oozie?
It is an extensible, scalable, and data-aware service to orchestrate Hadoop jobs, manage job dependencies, and execute jobs based on event triggers such as time and data availability.There are three types of jobs in Oozie:
1. Oozie workflow jobs
DAGS of actions which are jobs such as shell scripts, MapReduce, Sqoop, Streaming, Pipes, Pig, Hive etc.
2. Oozie coordinator jobs
Invoke Oozie workflow jobs based on specified event triggers - date/time, data availability.
3. Oozie bundle jobs
Related oozie coordinator jobs managed as a single job
- An Oozie bundle job can have one to many coordinator jobs
- An Oozie coordinator job can have one to many workflow jobs
- An Oozie workflow can have one to many actions
- An Oozie workflow can have zero to many sub-workflows
1.0.2. Glossary of Oozie terminology
(From Apache Oozie documentation)Action
An execution/computation task (Map-Reduce job, Pig job, a shell command). It can also be referred as task or 'action node'.
Workflow
A collection of actions arranged in a control dependency DAG (Direct Acyclic Graph). "control dependency" from one action to another means that the second action can't run until the first action has completed.
Workflow Definition
A programmatic description of a workflow that can be executed.
Workflow Definition Language
The language used to define a Workflow Definition.
Workflow Job
An executable instance of a workflow definition.
Workflow Engine
A system that executes workflows jobs. It can also be referred as a DAG engine.
1.0.3. Oozie Architecture
Oozie is a Java Web-Application that runs in a Java servlet-container (Tomcat) and uses a database to store:1. Definitions of Oozie jobs - workflow/coordinator/bundle
2. Currently running workflow instances, including instance states and variables
Oozie works with HSQL, Derby, MySQL, Oracle or PostgreSQL databases. By default, Oozie is configured to use Embedded Derby. Oozie bundles the JDBC drivers for HSQL, Embedded Derby and PostgreSQL.
For information about the different kinds of configuration such as User authentication, logging etc, refer:
http://oozie.apache.org/docs/3.3.0/AG_Install.html#Oozie_Configuration
This diagram is from a Yahoo deck on Oozie..
2.0. Oozie Workflow
2.0.1 What is an Oozie workflow?
An Oozie workflow is a DAG of hadoop computation/processing tasks (referred to as "actions") and flow "controls" to coordinate the tasks and manage dependencies of actions and their results.2.01.1. Actions:
Oozie workflow actions start jobs on remote nodes, and upon completion of the same, the processes executing the jobs callback Oozie and notify completion in response to which Oozie will start the next action. Actions can be hadoop fs , ssh, map reduce, hive, pig, sqoop, distcp, http, email commands or custom actions.
2.0.1.2. Controls:
Controls manage the execution path of actions and include start, fork, join, decision and end.
2.0.1.3. Parameterizing actions and decisions:
Actions and decisions can be parameterized with job properties, actions output (i.e. Hadoop counters) and file information (file exists, file size, etc). Formal parameters are expressed in the workflow definition as ${VAR} variables.
2.0.1.4. Workflow application:
A workflow application is an instance of a workflow, and is essentially a zip file containing everything needed to execute the actions within the workflows - the workflow definition (an XML file), JARs for Map/Reduce jobs, shells for streaming Map/Reduce jobs, native libraries, Hive/Pig/Sqoop scripts, files for distributed cache and other resource files.
2.0.1.5. Workflow definition:
A workflow definition is a DAG with control flow nodes and action nodes expressed in the XML based workflow definition language called hPDL (Hadoop Process Definition Language).
2.0.1.6. Workflow nodes:
Nodes encompassing actions in hPDL, are called action nodes, and nodes encompassing controls are called control flow nodes and together are referred to as workflow nodes.
2.0.2.1. Start control node
The start node is the entry point for a workflow job, it indicates the first workflow node the workflow job must transition to.
When a workflow is started, it automatically transitions to the node specified in the start .
A workflow definition must have one start node.
2.0.2.2. End control node
The end node is the end for a workflow job, it indicates that the workflow job has completed successfully. When a workflow job reaches the end it finishes successfully (SUCCEEDED). If one or more actions started by the workflow job are executing when the end node is reached, the actions will be killed. In this scenario the workflow job is still considered as successfully run. A workflow definition must have one end node.
2.0.2.3. Kill control node
The kill node allows a workflow job to kill itself. When a workflow job reaches the kill it finishes in error (KILLED). If one or more actions started by the workflow job are executing when the kill node is reached, the actions will be killed. A workflow definition may have zero or more kill nodes.
2.0.2.4. Decision node
A decision node enables a workflow to make a selection on the execution path to follow. The behavior of a decision node can be seen as a switch-case statement.
A decision node consists of a list of predicates-transition pairs plus a default transition. Predicates are evaluated in order or appearance until one of them evaluates to true and the corresponding transition is taken. If none of the predicates evaluates to true the default transition is taken. Predicates are JSP Expression Language (EL) expressions that resolve into a boolean value, true or false. The default element in the decision node indicates the transition to take if none of the predicates evaluates to true. All decision nodes must have a default element to avoid bringing the workflow into an error state if none of the predicates evaluates to true.
2.0.2.5. Fork/join control nodes
A fork node splits one path of execution into multiple concurrent paths of execution. A join node waits until every concurrent execution path of a previous fork node arrives to it. The fork and join nodes must be used in pairs. The join node assumes concurrent execution paths are children of the same fork node.
The FS commands are executed synchronously from within the FS action, the workflow job will wait until the specified file commands are completed before continuing to the next action. Path names specified in the fs action can be parameterized (templatized) using EL expressions. Each file path must specify the file system URI, for move operations, the target must not specified the system URI.
IMPORTANT: All the commands within fs action do not happen atomically, if a fs action fails half way in the commands being executed, successfully executed commands are not rolled back. The fs action, before executing any command must check that source paths exist and target paths don't exist (constraint regarding target relaxed for the move action. See below for details), thus failing before executing any command. Therefore the validity of all paths specified in one fs action are evaluated before any of the file operation are executed. Thus there is less chance of an error occurring while the fs action executes."
The email action allows sending emails in Oozie from a workflow application. An email action must provide to addresses, cc addresses (optional), a subject and a body . Multiple recipients of an email can be provided as comma separated addresses. The email action is executed synchronously, and the workflow job will wait until the specified emails are sent before continuing to the next action.
Apache documentation:
http://oozie.apache.org/docs/3.3.0/DG_EmailActionExtension.html
2.0.1.2. Controls:
Controls manage the execution path of actions and include start, fork, join, decision and end.
2.0.1.3. Parameterizing actions and decisions:
Actions and decisions can be parameterized with job properties, actions output (i.e. Hadoop counters) and file information (file exists, file size, etc). Formal parameters are expressed in the workflow definition as ${VAR} variables.
2.0.1.4. Workflow application:
A workflow application is an instance of a workflow, and is essentially a zip file containing everything needed to execute the actions within the workflows - the workflow definition (an XML file), JARs for Map/Reduce jobs, shells for streaming Map/Reduce jobs, native libraries, Hive/Pig/Sqoop scripts, files for distributed cache and other resource files.
2.0.1.5. Workflow definition:
A workflow definition is a DAG with control flow nodes and action nodes expressed in the XML based workflow definition language called hPDL (Hadoop Process Definition Language).
2.0.1.6. Workflow nodes:
Nodes encompassing actions in hPDL, are called action nodes, and nodes encompassing controls are called control flow nodes and together are referred to as workflow nodes.
2.0.2. Oozie control flow functionality
[Straight from Apache Oozie documentation]2.0.2.1. Start control node
The start node is the entry point for a workflow job, it indicates the first workflow node the workflow job must transition to.
When a workflow is started, it automatically transitions to the node specified in the start .
A workflow definition must have one start node.
2.0.2.2. End control node
The end node is the end for a workflow job, it indicates that the workflow job has completed successfully. When a workflow job reaches the end it finishes successfully (SUCCEEDED). If one or more actions started by the workflow job are executing when the end node is reached, the actions will be killed. In this scenario the workflow job is still considered as successfully run. A workflow definition must have one end node.
2.0.2.3. Kill control node
The kill node allows a workflow job to kill itself. When a workflow job reaches the kill it finishes in error (KILLED). If one or more actions started by the workflow job are executing when the kill node is reached, the actions will be killed. A workflow definition may have zero or more kill nodes.
2.0.2.4. Decision node
A decision node enables a workflow to make a selection on the execution path to follow. The behavior of a decision node can be seen as a switch-case statement.
A decision node consists of a list of predicates-transition pairs plus a default transition. Predicates are evaluated in order or appearance until one of them evaluates to true and the corresponding transition is taken. If none of the predicates evaluates to true the default transition is taken. Predicates are JSP Expression Language (EL) expressions that resolve into a boolean value, true or false. The default element in the decision node indicates the transition to take if none of the predicates evaluates to true. All decision nodes must have a default element to avoid bringing the workflow into an error state if none of the predicates evaluates to true.
2.0.2.5. Fork/join control nodes
A fork node splits one path of execution into multiple concurrent paths of execution. A join node waits until every concurrent execution path of a previous fork node arrives to it. The fork and join nodes must be used in pairs. The join node assumes concurrent execution paths are children of the same fork node.
3.0. Oozie actions
Only two action types are covered in this blog. More in subsequent blogs on oozie.3.0.1. About the FS (hdfs) action
"The fs action allows to manipulate files and directories in HDFS from a workflow application. The supported commands are move , delete , mkdir , chmod , touchz and chgrp .The FS commands are executed synchronously from within the FS action, the workflow job will wait until the specified file commands are completed before continuing to the next action. Path names specified in the fs action can be parameterized (templatized) using EL expressions. Each file path must specify the file system URI, for move operations, the target must not specified the system URI.
IMPORTANT: All the commands within fs action do not happen atomically, if a fs action fails half way in the commands being executed, successfully executed commands are not rolled back. The fs action, before executing any command must check that source paths exist and target paths don't exist (constraint regarding target relaxed for the move action. See below for details), thus failing before executing any command. Therefore the validity of all paths specified in one fs action are evaluated before any of the file operation are executed. Thus there is less chance of an error occurring while the fs action executes."
3.0.2. About the email action
The email action allows sending emails in Oozie from a workflow application. An email action must provide to addresses, cc addresses (optional), a subject and a body . Multiple recipients of an email can be provided as comma separated addresses. The email action is executed synchronously, and the workflow job will wait until the specified emails are sent before continuing to the next action.
Apache documentation:
http://oozie.apache.org/docs/3.3.0/DG_EmailActionExtension.html
4.0. Building and executing an Oozie workflow with HDFS action and Email action
Pictorial overview
Sample program specifics
Oozie web console
Screenshot of entry of sample application-
This concludes this blog. Happy hadooping!
Great work,you are really changing the world.
ReplyDeleteThanks.
knowledge is the greatest gift ever!!!
Thank you ,these posts were very useful.
ReplyDeleteamazing work
ReplyDeletegod bless u
I facing the issue when I am tring to configurit with smtp.gmail.com@465.I think ssl property should be true can.Let me how can we set ssl as true.
ReplyDeleteGreat work..... Thnx a lot for the efforts
ReplyDelete:) :) :)
hi, i am using cloudera quick start vm . I followed the instn given by you. when i run submit command it is giving http error: no response message. pls help
ReplyDeletegood work, thanks for sharing..
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteNice post...thanks for sharing.
ReplyDeleteCould you please help on this..i am ending with the below error.
org.apache.oozie.action.ActionExecutorException: EM007: Encountered an error while sending the email message over SMTP.
Srinivas, have you resolved this issue ? any idea what causes this ? thanks ..
ReplyDeleteVery nice post, very informative blog.
ReplyDeleteThanks for sharing knowledge.
I am new to this env. I need to compare staging target table in oracle database against Source Hive data for 40 million data ,Please advise the optimal way of writing the code.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThank You for covering detailed apache oozie workflow with hdfs This helps a lot in HADOOP developer training.Thank you for chance to Lear from real time experts like you.
ReplyDeletethank yo very much providing valuable information.we are very glad to leave a comment .please provide as many as articles as you can.it is very helpful to hadoop learners please also visit our blog Hadoop training in hyderabad
ReplyDeleteHadoop classes in hyderabad visit:-www.rstrainings.com
First of all thank you for sharing blog.. Before reading this blog i have no knowledge of oozie but now i learnt more useful information about oozie from this blog.. thanks a lot for sharing this blog to us
ReplyDeletehadoop training in velachery | big data training in velachery
Great and helpful blog to everyone.. This blog having more useful information which having clear explanation so easy and interesting to read.. This blog really useful to develop my knowledge in hadoop and cracking interview easily.. thanks a lot for sharing this blog to us...
ReplyDeletehadoop training institute in chennai | big data training institute in chennai | hadoop training in velachery
These provided information was really so nice,thanks for giving that post and the more skills to develop after refer that post. Your articles really impressed for me,because of all information so nice.
ReplyDeleteHadoop Training in Chennai
Java Training in Chennai
Good work Sir, Thanks for the proper explanation about HDFS. I found one of the good resource related to HDFS and Hadoop. It is providing in-depth knowledge on HDFS and HDFS Architecture. which I am sharing a link with you where you can get more clear on HDFS and Hadoop. To know more Just have a look at Below link
ReplyDeleteHDFS
Hadoop
HDFS Architecture
.
Thank you very much! :)
ReplyDeleteThe Blog Content is very informative and helpful. Please share more content. Thanks.
ReplyDeleteHadoop Training in Gurgaon
Hadoop Institute in Gurgaon
Hadoop Course in Gurgaon
hi!!! i had a complete knowledge on oozie especially the github links was very useful to understand
ReplyDeleteabout the workflow with example output ;great job . Hadoop Training in Velachery .
Thanks for the explanation. It’s really helpful. Please keep sharing.
ReplyDeleteHadoop Course in delhi
Thanks for sharing the descriptive information on Python course. It’s really helpful to me since I'm taking Python training. Keep doing the good work and if you are interested to know more on Python, do check this Python tutorial.https://www.youtube.com/watch?v=1jMR4cHBwZE
ReplyDeleteNice blog. Thank you for sharing such useful post. Keep posting
ReplyDeleteHadoop Training in Gurgaon
Thanks for sharing this blog. This is so informative blog
ReplyDeleteHadoop Training in Noida
http://therpalab.com/automation-anywhere
ReplyDeletehttp://therpalab.com
http://therpalab.com/blue-prism
from therpalab
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
ReplyDeleteRegards,
Big Data Hadoop Training in electronic city, Bangalore
After reading this blog i very strong in this topics and this blog really helpful to all.. Big data Hadoop online Course Hyderabad
ReplyDeletethakyou it vry nice blog for beginners
ReplyDeletehttps://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/
This is really great informative blog. Keep sharing Full Stack Training in Hyderabad
ReplyDeleteGood Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
ReplyDeletehttps://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/
Excellent post, it will be definitely helpful for many people. Keep posting more like this.
ReplyDeleteBlue Prism Training in Chennai
UiPath Training in Chennai
UiPath Training Institutes in Chennai
RPA Training in Chennai
Data Science Course in Chennai
Blue Prism Training in Velachery
Blue Prism Training in Tambaram
I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.
ReplyDeleteData Science training in chennai
Data Science training in OMR
Data Science training in chennai
Data Science Training in Chennai
Data Science training in Chennai
Data science training in bangalore
Dr. Rajni Goyal is best skin specialist in gurgaon
ReplyDeleteFind List of best playschools in gurgaon city of Haryana
ReplyDeletevisit here -> DEVOPS TRAINING IN BANGALORE
ReplyDeleteBig Data Analysing different types of big data sets.Big Data is used for better understand the customer .In future Big Data is used help to set the lager and complex data management application.
ReplyDeleteBig Data Training in Gurgaon
I am so happy after reading your blog. It’s very useful blog for us.
ReplyDeleteBig Data Hadoop in-house Corporate training in Nigeria
Very impressive article! The blog is highly informative and has answered all my questions. To introduce about our company and the activities, Techno Data Group
ReplyDeleteis a database provider that helps you to boost your sales & grow your business through well-build Hadoop Users Email.
Hey.. I checked your blog its really useful.. Provides lot of information.. Do check my blogs also on https://exploring2gether.com/
ReplyDeletehey...It is highly comprehensive and elaborated. Thanks for sharing!
ReplyDeleteLocalebazar- Your single guide for exploring delicious foods, travel diaries and fitness stories.
Visit us for more- localebazar.com
Thanks for sharing such helpful information with us I appreciate your effort of writing a value able piece.
ReplyDeletei also write on Spanish Language Course in Delhi.
Please share your review on that.
Great Article… its is very very helpful for all of us and I never get bored while reading your article because, they are becomes a more and more interesting from the starting lines until the end.
ReplyDeleteArtificial Intelligence Certification Training
Java Certification Training
AWS Certification Training
Machine Learning Certification Training
Data Science Certification Training
DevOps Certification Training
It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
ReplyDeleteDigital Marketing Certification Training
Python Certification Training
Selenium Training
very nice explanation,keep posting more Blog articles with us.
ReplyDeletehadoop admin online course
Nice post.
ReplyDeleteQuality Stage training
Selenium online training
Selenium training
Spark online training
Spark training
splunk admin online training
splunk admin training
splunk development online training
splunk development training
splunk online training
splunk training
sql azure online training
sql azure training
sql plsql online training
sql plsql training
sql server dba online training
sql server dba training
sql server developer online training
sql server developer training
awesome content you have shared on your blog
ReplyDeleteyou can check our GYC silicon straps high quality printing premium looking bands straps compatible for Mi Xiomi BAND 3 BAND 4. Click on the link given below
CLICK HERE
CLICK HERE
CLICK HERE
CLICK HERE
The SkoolBeep platform provides schools the option to conduct online classes for your students.
ReplyDeletehttps://www.skoolbeep.com/online-classes/
[url="https://www.skoolbeep.com/online-classes/"]online classes app[/url]
Deleteحجر هاشمي ابيض ازازي
ReplyDelete