1.0. What's in this post?This post is a part of a series, focussed on log parsing in Java Mapreduce, Pig, Hive, Python...This one covers a simple log parser in Cascading, and includes a sample program, data and commands.
Documentation on Cascading:
Log parsing in Hadoop -Part 1: Java
Log parsing in Hadoop -Part 2: Hive
Log parsing in Hadoop -Part 3: Pig
Log parsing in Hadoop -Part 4: Python
Log parsing in Hadoop -Part 5: Cascading
Log parsing in Hadoop -Part 6: Morphlines
2.0. Sample program
2.0.1. What the program does..
a) It reads syslog generated logs stored in HDFS
b) Regex parses them
c) Writes successfully parsed records to files in HDFS
d) Writes records that dont match the pattern to HDFS
e) Writes a report to HDFS that contains the count of distinct processes logged.
2.0.2. Sample log data
2.0.3. Directory structure of log files
2.0.4. Log parser in Cascading
2.0.5. build.gradle file
Gradle documentation is available at- http://www.gradle.org
Here is the build.gradle...
2.0.6. Data and code download
2.0.7. Commands (load data, execute program)