Wednesday, July 3, 2013

Running native mapreduce jobs inside Pig

There might be situations were you may have to reuse java map reduce programs within a pig program. This blog includes a sample pig script, with associated jars and sample data. The input is Syslog generated log files, and the output is a count of occurrences of processes logged inception to date.

Apache Pig documentation:
http://pig.apache.org/docs/r0.10.0/basic.html#mapreduce

My blog 1 on Log parsing in Hadoop (link) covers the Java code. This blog blog uses the jar from the blog in a pig script.

Details on running native mapreduce job in Pig scripts:
This gist includes a pig latin script to parse Syslog generated log files through a
java mapreduce program that uses regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Related gist that covers the java code - https://gist.github.com/airawat/5915374
Pig version: version 0.10.0
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Pig script: 04-PigLatinScript
Pig script execution command: 05-PigLatinScriptExecution
Output: 06-Output
Sample data
------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
Structure
----------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
Data download
-------------
https://groups.google.com/forum/?hl=en#!topic/hadooped/DMQVIwBUQOo
Directory structure
-------------------
LogParserSamplePigMR
Data
airawat-syslog
2013
04
messages
2013
05
messages
lib
LogEventCount.jar
SysLog-PigMR-Report.pig
Commands to load to HDFS [03-HdfsLoadCommands]
----------------------------------------------
$ hadoop fs -put LogParserSamplePigMR
$ hadoop fs -ls -R LogParserSamplePigMR | awk '{print $8}'
LogParserSamplePigMR/Data
LogParserSamplePigMR/Data/airawat-syslog
LogParserSamplePigMR/Data/airawat-syslog/2013
LogParserSamplePigMR/Data/airawat-syslog/2013/04
LogParserSamplePigMR/Data/airawat-syslog/2013/04/messages
LogParserSamplePigMR/Data/airawat-syslog/2013/05
LogParserSamplePigMR/Data/airawat-syslog/2013/05/messages
LogParserSamplePigMR/SysLog-PigMR-Report.pig
LogParserSamplePigMR/lib
LogParserSamplePigMR/lib/LogEventCount.jar
ParserSamplePigMR/reportDir/_logs/history/job_201306261042_0054_1372873417824_akhanolk_PigLatin%3ASysLog-PigMR-Report.pig
LogParserSamplePigMR/reportDir/part-m-00000
/*----------------------------------------*/
/*PigLatinScript - SysLog-PigMR-Report.pig*/
/*----------------------------------------*/
rmf LogParserSamplePigMR/outputDir
rmf LogParserSamplePigMR/inputDir
rmf LogParserSamplePigMR/reportDir
raw_log_DS =
LOAD 'LogParserSamplePigMR/Data/airawat-syslog/*/*/*' as line;
report_DS = MAPREDUCE 'lib/LogEventCount.jar' STORE raw_log_DS INTO 'LogParserSamplePigMR/inputDir' LOAD 'LogParserSamplePigMR/outputDir' AS (process:chararray, count: int) `Airawat.O
ozie.Samples.LogEventCount LogParserSamplePigMR/inputDir LogParserSamplePigMR/outputDir`;
store report_DS INTO 'LogParserSamplePigMR/reportDir';
Command to run the pig script
------------------------------
These should be run after the data, scripts and jars are loaded to HDFS - covered in section 03-HdfsLoadCommands
$ cd LogParserSamplePigMR
$ pig SysLog-PigMR-Report.pig
Command to view output
-----------------------
$ hadoop fs -cat LogParserSamplePigMR/reportDir/part*
Output
-------
init: 23
kernel: 58
ntpd_initres[1705]: 792
sudo: 2
udevd[361]: 1
view raw 06-Output hosted with ❤ by GitHub

10 comments:

  1. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  2. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  3. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    Big Data Hadoop training in electronic city

    ReplyDelete
  4. I believe that your blog will surely help the readers who are really in need of this vital piece of information. Waiting for your updates. i need some more detais.
    Ai & Artificial Intelligence Course in Chennai
    PHP Training in Chennai
    Ethical Hacking Course in Chennai Blue Prism Training in Chennai
    UiPath Training in Chennai

    ReplyDelete