Tuesday, July 2, 2013

Log parsing in Hadoop -Part 3: Pig Latin

This post includes sample scripts, data and commands to parse a log file in Pig Latin using regex.


Related blogs:

Log parsing in Hadoop -Part 1: Java - using regex 
Log parsing in Hadoop -Part 2: Hive - using regex 
Log parsing in Hadoop -Part 3: Pig - using regex 
Log parsing in Hadoop -Part 4: Python - using regex 



This gist includes a pig latin script to parse Syslog generated log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Pig script: 04-PigLatinScript
Pig script execution command: 05-PigLatinScriptExecution
Output: 06-Output
Sample data
------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
Structure
----------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
Data and Script download
------------------------
https://groups.google.com/forum/?hl=en#!topic/hadooped/Wix8ZznQGJU
Directory structure
-------------------
LogParserSamplePig
Data
airawat-syslog
2013
04
messages
2013
05
messages
SysLog-Pig-Report.pig
hdfs load commands
-------------------
$ hadoop fs -put LogParserSamplePig/
Validate load
-------------
$ hadoop fs -ls -R LogParserSamplePig | awk '{print $8}'
Expected directory structure
-----------------------------
LogParserSamplePig/Data
LogParserSamplePig/Data/airawat-syslog
LogParserSamplePig/Data/airawat-syslog/2013
LogParserSamplePig/Data/airawat-syslog/2013/04
LogParserSamplePig/Data/airawat-syslog/2013/04/messages
LogParserSamplePig/Data/airawat-syslog/2013/05
LogParserSamplePig/Data/airawat-syslog/2013/05/messages
LogParserSamplePig/SysLog-Pig-Report.pig
#Pig Latin script - SysLog-Pig-Report.pig
rmf LogParserSamplePig/output
raw_log_DS =
-- load the logs into a sequence of one element tuples
LOAD 'LogParserSamplePig/Data/airawat-syslog/*/*/*' AS line;
parsed_log_DS =
-- for each line/log parse the same into a
-- structure with named fields
FOREACH raw_log_DS
GENERATE
FLATTEN (
REGEX_EXTRACT_ALL(
line,
'(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W*\\w*)\\s+(.*?\\:)\\s+(.*$)'
)
)
AS (
month_name: chararray,
day: chararray,
time: chararray,
host: chararray,
process: chararray,
log: chararray
);
report_draft_DS =
--Generate dataset containing just the data needed
FOREACH parsed_log_DS GENERATE month_name,process;
grouped_report_DS =
--Group the dataset
GROUP report_draft_DS BY (month_name,process);
aggregate_report_DS =
--Compute count
FOREACH grouped_report_DS {
GENERATE group.month_name,group.process,COUNT(report_draft_DS) AS frequency;
}
sorted_DS =
ORDER aggregate_report_DS by $0,$1;
STORE sorted_DS INTO 'LogParserSamplePig/output/SortedResults';
Execute the pig script on the cluster
--------------------------------------
$ pig SysLog-Pig-Report.pig
View output
-----------
$ hadoop fs -cat LogParserSamplePig/output/SortedResults/part*
Output
-------
Apr sudo: 1
May init: 23
May kernel: 58
May ntpd_initres[1705]: 792
May sudo: 1
May udevd[361]: 1
view raw 06-Output hosted with ❤ by GitHub

7 comments:

  1. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  2. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  3. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    Big Data Hadoop training in electronic city

    ReplyDelete

  4. Very Impressive Big Data Hadoop tutorial. The content seems to be pretty exhaustive and excellent and will definitely help in learning Big Data Hadoop course. I'm also a learner taken up Big Data Hadoop Tutorial and I think your content has cleared some concepts of mine. While browsing for Hadoop tutorials on YouTube i found this fantastic video on Big Data Hadoop Tutorial.Do check it out if you are interested to know more.https://www.youtube.com/watch?v=nuPp-TiEeeQ&

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete