Monday, June 17, 2013

Apache Oozie -Part 4: Oozie workflow with java mapreduce action


What's covered in the blog?

1. Documentation on the Oozie map reduce action
2. A sample workflow that includes oozie map-reduce action to process some syslog generated log files.  Instructions on loading sample data and running the workflow are provided, along with some notes based on my learnings.

Versions covered:
Oozie 3.3.0; Map reduce new API

Related blogs:

Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie workflow - shell action +passing output from one action to another
Blog 13: Oozie workflow - SSH action

Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

About the Oozie MapReduce action
Excerpt from Apache Oozie documentation...

The map-reduce action starts a Hadoop map/reduce job from a workflow. Hadoop jobs can be Java Map/Reduce jobs or streaming jobs.

A map-reduce action can be configured to perform file system cleanup and directory creation before starting the map reduce job. This capability enables Oozie to retry a Hadoop job in the situation of a transient failure (Hadoop checks the non-existence of the job output directory and then creates it when the Hadoop job is starting, thus a retry without cleanup of the job output directory would fail).

The workflow job will wait until the Hadoop map/reduce job completes before continuing to the next action in the workflow execution path.

The counters of the Hadoop job and job exit status (=FAILED=, KILLED or SUCCEEDED ) must be available to the workflow job after the Hadoop jobs ends. This information can be used from within decision nodes and other actions configurations.

The map-reduce action has to be configured with all the necessary Hadoop JobConf properties to run the Hadoop map/reduce job.

Hadoop JobConf properties can be specified in a JobConf XML file bundled with the workflow application or they can be indicated inline in the map-reduce action configuration.

The configuration properties are loaded in the following order, streaming , job-xml and configuration , and later values override earlier values.

Streaming and inline property values can be parameterized (templatized) using EL expressions.

The Hadoop mapred.job.tracker and fs.default.name properties must not be present in the job-xml and inline configuration.


Apache Oozie documentation:
http://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a3.2.2_Map-Reduce_Action


Components of a workflow with java map reduce action:



Sample workflow

Highlights

The sample workflow application runs a java map reduce program that parses log files (syslog generated) in HDFS and generates a report on the same.

The following is a pictorial representation of the workflow.


Workflow application details

This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: java mapreduce action; Oozie controls
covered: start, kill, end; The java program uses regex to parse the logs, and
also extracts the path of the mapper input directory path and includes in the
key emitted.
Note: The reducer can be specified as a combiner as well.
Usecase
-------
Parse Syslog generated log files to generate reports;
Pictorial overview of job:
--------------------------
http://hadooped.blogspot.com/2013/06/apache-oozie-part-4-oozie-workflow-with.html
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Java MR - Mapper code: 04A-MapperJavaCode
Java MR - Reducer code: 04B-ReducerJavaCode
Java MR - Driver code: 04C-DriverJavaCode
Command to test Java MR program: 04D-CommandTestJavaMRProg
Oozie job properties file: 05-OozieJobProperties
Oozie workflow file: 06-OozieWorkflowXML
Oozie commands 07-OozieJobExecutionCommands
Output -Report1 08-OutputOfJavaProgram
Oozie web console - screenshots 09-OozieWebConsoleScreenshots
01a. Sample data
-----------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
01b. Structure
---------------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
02a. Data download
-------------------
gitHub:
https://github.com/airawat/OozieSamples
Directory structure
-------------------
oozieProject
data
airawat-syslog
<<Node-Name>>
<<Year>>
<<Month>>
messages
workflowJavaMapReduceAction
workflow.xml
job.properties
lib
LogEventCount.jar
03-Hdfs load commands
----------------------
$ hadoop fs -mkdir oozieProject
$ hadoop fs -put oozieProject/* oozieProject/
// Source code for Mapper
//-----------------------------------------------------------
// LogEventCountMapper.java
//-----------------------------------------------------------
// Java program that parses logs using regex
// The program counts the number of processes logged by year.
// E.g. Key=2013-ntpd; Value=1;
package airawat.oozie.samples;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class LogEventCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
String strLogEntryPattern = "(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W*\\w*)\\s+(.*?\\:)\\s+(.*$)";
static final int NUM_FIELDS = 6;
Text strEvent = new Text("");
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String strLogEntryLine = value.toString();
Pattern objPtrn = Pattern.compile(strLogEntryPattern);
Matcher objPatternMatcher = objPtrn.matcher(strLogEntryLine);
if (!objPatternMatcher.matches() || NUM_FIELDS != objPatternMatcher.groupCount()) {
System.err.println("Bad log entry (or problem with RE?):");
System.err.println(strLogEntryLine);
return;
}
/*
System.out.println("Month_Name: " + objPatternMatcher.group(1));
System.out.println("Day: " + objPatternMatcher.group(2));
System.out.println("Time: " + objPatternMatcher.group(3));
System.out.println("Node: " + objPatternMatcher.group(4));
System.out.println("Process: " + objPatternMatcher.group(5));
System.out.println("LogMessage: " + objPatternMatcher.group(6));
*/
//TODO: Move the derivation of the filename (below) into a instance variable from
//the file split to the setup() method of mapper - so it is a one-time task
//- instead of the inefficient code below.
strEvent.set(((FileSplit)context.getInputSplit()).getPath().toString().substring((((FileSplit)context.getInputSplit()).getPath().toString().length()-16), (((FileSplit)context.getInputSplit()).getPath().toString().length()-12)) + "-" + ((objPatternMatcher.group(5).toString().indexOf("[")) == -1 ? (objPatternMatcher.group(5).toString().substring(0,(objPatternMatcher.group(5).length()-1))) : (objPatternMatcher.group(5).toString().substring(0,(objPatternMatcher.group(5).toString().indexOf("["))))));
context.write(strEvent, new IntWritable(1));
}
}
// Source code for reducer
//--------------------------
// LogEventCountReducer.java
//--------------------------
package airawat.oozie.samples;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class LogEventCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int intEventCount = 0;
for (IntWritable value : values) {
intEventCount += value.get();
}
context.write(key, new IntWritable(intEventCount));
}
}
// Source code for driver
//--------------------------
// LogEventCount.java
//--------------------------
package airawat.oozie.samples;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
public class LogEventCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf(
"Usage: Airawat.Oozie.Samples.LogEventCount <input dir> <output dir>\n");
System.exit(-1);
}
//Instantiate a Job object for your job's configuration.
Job job = new Job();
//Job jar file
job.setJarByClass(LogEventCount.class);
//Job name
job.setJobName("Syslog Event Rollup");
//Paths
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//Mapper and reducer classes
job.setMapperClass(LogEventCountMapper.class);
job.setReducerClass(LogEventCountReducer.class);
//Job's output key and value classes
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//Number of reduce tasks
job.setNumReduceTasks(3);
//Start the MapReduce job, wait for it to finish.
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
Commands to test the java program in isolation
-----------------------------------------------
a) Command to run the program
$ hadoop jar oozieProject/workflowJavaMapReduceAction/lib/LogEventCount.jar airawat.oozie.samples.LogEventCount "oozieProject/data/*/*/*/*/*" "oozieProject/workflowJavaMapReduceAction/myCLIOutput"
b) Command to view results
$ hadoop fs -cat oozieProject/workflowJavaMapReduceAction/myCLIOutput/part* | sort
c) Results
2013-NetworkManager 7
22013-console-kit-daemon 7
2013-gnome-session 11
2013-init 166
2013-kernel 810
2013-login 2
2013-NetworkManager 7
2013-nm-dispatcher.action 4
2013-ntpd_initres 4133
2013-polkit-agent-helper-1 8
2013-pulseaudio 18
2013-spice-vdagent 15
2013-sshd 6
2013-sudo 8
2013-udevd 6
#*****************************
# job.properties
#*****************************
nameNode=hdfs://cdh-nn01.chuntikhadoop.com:8020
jobTracker=cdh-jt01:8021
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
oozieProjectRoot=${nameNode}/user/${user.name}/oozieProject
appPath=${oozieProjectRoot}/workflowJavaMapReduceAction
oozie.wf.application.path=${appPath}
inputDir=${oozieProjectRoot}/data/*/*/*/*/*
outputDir=${appPath}/output
<!--*************************************************-->
<!--*******06-workflow.xml***************************-->
<!--*************************************************-->
<workflow-app name="WorkFlowJavaMapReduceAction" xmlns="uri:oozie:workflow:0.1">
<start to="mapReduceAction"/>
<action name="mapReduceAction">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${outputDir}"/>
</prepare>
<configuration>
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>airawat.oozie.samples.LogEventCountMapper</value>
</property>
<property>
<name>mapreduce.reduce.class</name>
<value>airawat.oozie.samples.LogEventCountReducer</value>
</property>
<property>
<name>mapred.mapoutput.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapred.mapoutput.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapred.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
<property>
<name>mapreduce.job.acl-view-job</name>
<value>*</value>
</property>
<property>
<name>oozie.launcher.mapreduce.job.acl-view-job</name>
<value>*</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>false</value>
</property>
<property>
<name>oozie.libpath</name>
<value>${appPath}/lib</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="killJob"/>
</action>
<kill name="killJob">
<message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>
</kill>
<end name="end" />
</workflow-app>
07. Oozie commands
-------------------
Note: Replace oozie server and port, with your cluster-specific.
1) Submit job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowJavaMapReduceAction/job.properties -submit
job: 0000012-130712212133144-oozie-oozi-W
2) Run job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -start 0000014-130712212133144-oozie-oozi-W
3) Check the status:
$ oozie job -oozie http://cdh-dev01:11000/oozie -info 0000014-130712212133144-oozie-oozi-W
4) Suspend workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -suspend 0000014-130712212133144-oozie-oozi-W
5) Resume workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -resume 0000014-130712212133144-oozie-oozi-W
6) Re-run workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowJavaMapReduceAction/job.properties -rerun 0000014-130712212133144-oozie-oozi-W
7) Should you need to kill the job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -kill 0000014-130712212133144-oozie-oozi-W
8) View server logs:
$ oozie job -oozie http://cdh-dev01:11000/oozie -logs 0000014-130712212133144-oozie-oozi-W
Logs are available at:
/var/log/oozie on the Oozie server.
08-Workflow application output
-------------------------------
$ hadoop fs -ls -R oozieProject/workflowJavaMapReduceAction/output/part* | awk '{print $8}'
oozieProject/workflowJavaMapReduceAction/output/part-r-00000
oozieProject/workflowJavaMapReduceAction/output/part-r-00001
$ hadoop fs -cat oozieProject/workflowJavaMapReduceAction/output/part*
2013-init 166
2013-polkit-agent-helper-1 8
2013-spice-vdagent 15
2013-sshd 6
2013-udevd 6
2013-NetworkManager 7
2013-console-kit-daemon 7
2013-gnome-session 11
2013-kernel 810
2013-login 2
2013-nm-dispatcher.action 4
2013-ntpd_initres 4133
2013-pulseaudio 18
2013-sudo 8
09 - Oozie web console - screenshots
-------------------------------------
Available at-
http://hadooped.blogspot.com/2013/06/apache-oozie-part-4-oozie-workflow-with.html

Oozie web GUI - screenshots

http://YourOozieServer:TypicallyPort11000/oozie/






Do share, if you have any additional insights that can be addd to the blog.

References


Map reduce cookbook
https://cwiki.apache.org/OOZIE/map-reduce-cookbook.html

How to use a sharelib in Oozie
http://blog.cloudera.com/blog/2012/12/how-to-use-the-sharelib-in-apache-oozie/

Everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie

http://www.slideshare.net/ChicagoHUG/everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie

Oozie workflow use cases

https://github.com/yahoo/oozie/wiki/Oozie-WF-use-cases 






49 comments:

  1. Hi,

    I am getting following error after running Oozie in Cloudera VM.
    Error: AUTHENTICATION : Could not authenticate, Authentication failed, status: -1, message: null

    Please help me out in solving this issue.Thanks in advance

    ReplyDelete
  2. I was looking about the Oracle Training in Chennai for something like this ,Thank you for posting the great content..I found it quiet interesting, hopefully you will keep posting such blogs…

    Oracle Training in chennai

    ReplyDelete
  3. Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for me.I get a lot of great information from this blog. Thank you for your sharing this informative blog.

    Pega Training in Chennai

    ReplyDelete
  4. I have read your blog and i got a very useful and knowledgeable information from your blog.You have done a great job.

    SAS Training in Chennai

    ReplyDelete
  5. This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic

    Green Technologies In Chennai

    ReplyDelete
  6. when i try to run this command i get this error

    COMMAND:oozie job -oozie http://localhost:11000/oozie -config /home/hduser/oozie/oozie-4.1.0/oozie-bin/examples/apps/map-
    reduce/job.properties -run

    ERROR:Error: E0501 : E0501: Could not perform authorization operation, Call From ubuntu/127.0.1.1 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused


    ReplyDelete
  7. Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)

    Software testing training in chennai | Testing courses in chennai | Software testing course

    ReplyDelete
  8. Cloud computing is the next big thing, through cloud the users have the liberty to use a shared network. The companies can focus on core business parts rather than investing heavily on infrastucture.
    cloud computing training in chennai|cloud computing courses in chennai|cloud computing training

    ReplyDelete
  9. Oracle database management system is a very secure and reliable platform for storing database and secured information.Due its reliable and trustworthy factor oracle DBA is famous all around the globe and is prefered by many large MNC which are using database management system.
    oracle training in Chennai | oracle dba training in chennai | oracle training institutes in chennai

    ReplyDelete
  10. Great post. This is useful. Thanks for sharing.

    IELTS classes in Kuwait

    ReplyDelete
  11. Thanks for sharing such a great information..Its really nice and informative..
    sas training in chennai

    ReplyDelete
  12. • Nice information in the post....Keep on sharing..
    ios training in chennai

    ReplyDelete
  13. You have shared very useful details with me. Thanks for your great effort.
    DBA course | Oracle dba course

    ReplyDelete
  14. Awesome.This blog worked perfectly for me. Thanks!
    Regards,
    Kevin Costner

    ReplyDelete
  15. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
    amazon-web-services-training-institute-in-chennai

    ReplyDelete
  16. I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
    selenium training in chennai

    ReplyDelete
  17. The young boys ended up stimulated to read through them and now have unquestionably been having fun with these things.

    Digital Marketing Training in Chennai

    ReplyDelete
  18. Quite Interesting post!!! Thanks for posting such a useful post. I wish to read your upcoming post to enhance my skill set, keep blogging.
    Regards,

    Ece Project Centers in Chennai | Mba Application Projects in Chennai

    ReplyDelete
  19. It is really very helpful for me and I have gathered some important information from this blog.

    Data Mining Project Centers in Chennai | Secure Computing Project Centers in Chennai.

    ReplyDelete
  20. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  21. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    Big Data Hadoop training in electronic city

    ReplyDelete
  22. Excellant post!!!. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.
    java training in chennai | java training in bangalore

    java online training | java training in pune

    selenium training in chennai

    selenium training in bangalore

    ReplyDelete

  23. This article is very much helpful and i hope this will be an useful information for the needed one.Keep on updating these kinds of informative things...

    Embedded System training in Chennai | Embedded system training institute in chennai | PLC Training institute in chennai | IEEE final year projects in chennai | VLSI training institute in chennai

    ReplyDelete
  24. Thanks Admin for sharing such a useful post, I hope it’s useful to many individuals for developing their skill to get good career.
    Python training in pune
    AWS Training in chennai
    Python course in chennai

    ReplyDelete
  25. Amazing information,thank you for your ideas.after along time i have studied
    an interesting information's.we need more updates in your blog.
    Android Courses in OMR
    Android Training Institutes in T nagar
    Best Android Training Institute in Anna nagar
    android app development course in bangalore

    ReplyDelete
  26. I ‘d mention that most of us visitors are endowed to exist in a fabulous place with very many wonderful individuals with very helpful things.
    nebosh course in chennai

    ReplyDelete
  27. This comment has been removed by the author.

    ReplyDelete
  28. Nice tutorial. Thanks for sharing the valuable information. it’s really helpful. Who want to learn this blog most helpful. Keep sharing on updated tutorials…
    Devops Training courses
    Devops Training in Bangalore
    Best Devops Training in pune
    Devops interview questions and answers
    Devops interview questions and answers

    ReplyDelete
  29. I really appreciate this post. I’ve been looking all over for this! Thank goodness I found it on Bing. You’ve made my day! Thx again!
    python training Course in chennai
    python training in Bangalore
    Python training institute in bangalore

    ReplyDelete
  30. Its a wonderful post and very helpful, thanks for all this information. You are including better information regarding this topic in an effective way. T hank you so much.
    Selenium Training
    Selenium Course in Chennai
    Selenium Training Institute in Chennai
    Best Software Testing Training Institute in Chennai
    Testing training
    Software testing training institutes

    ReplyDelete
  31. The way of you expressing your ideas is really good.you gave more useful ideas for us and please update more ideas for the learners.
    vmware training in bangalore
    vmware courses in bangalore
    vmware Training in Ambattur
    vmware Training in Guindy

    ReplyDelete
  32. The knowledge of technology you have been sharing thorough this post is very much helpful to develop new idea. here by i also want to share this.
    Devops Training in Chennai | Devops Training Institute in Chennai

    ReplyDelete

  33. Hello, I read your blog occasionally, and I own a similar one, and I was just wondering if you get a lot of spam remarks? If so how do you stop it, any plugin or anything you can advise? I get so much lately it’s driving me insane, so any assistance is very much appreciated.
    Android Course Training in Chennai | Best Android Training in Chennai
    Selenium Course Training in Chennai | Best Selenium Training in chennai
    Devops Course Training in Chennai | Best Devops Training in Chennai

    ReplyDelete
  34. Thank you for sharing such a nice post!

    Get Web Methods Training in Bangalore from Real Time Industry Experts with 100% Placement Assistance in MNC Companies. Book your Free Demo with Softgen Infotech.

    ReplyDelete
  35. nice post

    https://www.techsoftskillsource.com/

    ReplyDelete