Tuesday, December 31, 2013

Log parsing in Hadoop - Part 5: Cascading

1.0. What's in this post?

This post is a part of a series, focussed on log parsing in Java Mapreduce, Pig, Hive, Python...This one covers a simple log parser in Cascading, and includes a sample program, data and commands.

Documentation on Cascading:
http://www.cascading.org/documentation/

Other related blogs:
Log parsing in Hadoop -Part 1: Java 
Log parsing in Hadoop -Part 2: Hive 
Log parsing in Hadoop -Part 3: Pig 
Log parsing in Hadoop -Part 4: Python
Log parsing in Hadoop -Part 5: Cascading
Log parsing in Hadoop -Part 6: Morphlines 


2.0. Sample program


2.0.1. What the program does..
a) It reads syslog generated logs stored in HDFS
b) Regex parses them 
c) Writes successfully parsed records to files in HDFS
d) Writes records that dont match the pattern to HDFS
e) Writes a report to HDFS that contains the count of distinct processes logged.

2.0.2. Sample log data


2.0.3. Directory structure of log files


2.0.4. Log parser in Cascading


2.0.5. build.gradle file
Gradle documentation is available at- http://www.gradle.org
Here is the build.gradle...

2.0.6. Data and code download 




2.0.7. Commands (load data, execute program)


2.0.8. Results