Tuesday, December 31, 2013

Log parsing in Hadoop - Part 5: Cascading

1.0. What's in this post?

This post is a part of a series, focussed on log parsing in Java Mapreduce, Pig, Hive, Python...This one covers a simple log parser in Cascading, and includes a sample program, data and commands.

Documentation on Cascading:
http://www.cascading.org/documentation/

Other related blogs:
Log parsing in Hadoop -Part 1: Java 
Log parsing in Hadoop -Part 2: Hive 
Log parsing in Hadoop -Part 3: Pig 
Log parsing in Hadoop -Part 4: Python
Log parsing in Hadoop -Part 5: Cascading
Log parsing in Hadoop -Part 6: Morphlines 


2.0. Sample program


2.0.1. What the program does..
a) It reads syslog generated logs stored in HDFS
b) Regex parses them 
c) Writes successfully parsed records to files in HDFS
d) Writes records that dont match the pattern to HDFS
e) Writes a report to HDFS that contains the count of distinct processes logged.

2.0.2. Sample log data


2.0.3. Directory structure of log files


2.0.4. Log parser in Cascading


2.0.5. build.gradle file
Gradle documentation is available at- http://www.gradle.org
Here is the build.gradle...

2.0.6. Data and code download 




2.0.7. Commands (load data, execute program)


2.0.8. Results 






Wednesday, November 13, 2013

UDF's Part 2: Custom GenericUDF in Hive (NVL2)


1.0. What's in this blog?


In my previous blog on creating custom UDFs in Hive, I covered a sample basic UDF.  This blog covers generic UDF creation, to mimic the same NVL2 functionality covered in the previous blog.  It includes sample data, java code for creating the UDF, expected results, commands to execute and the output.  
[hive 0.10]

About UDFs:
UDF stands for User Defined Function.  In Hive, there are (a) reusable functions available, as part of core Hive (out of the box) that can be used in Hive queries; They are called UDFs, even though they are not user-defined.  And then there are (b) functions that one can create in Java, also called UDFs, and use in Hive queries.  The focus of this blog is custom UDFs (b), specifically generic UDFs.

About generic UDF:
UDFs in Hive have are extensions of either UDF or GenericUDF classes.  GenericUDFs are more optimal from a performance perspective as they use short circuit evaluation and lazy evaluationwhen compared to UDFs that use reflection.  GenericUDFs support non-primitive Hive types like arrays, structs and maps in addition to primitive types, unlike UDFs that support only primitive types.  

About NVL2:
NVL2 takes three parameters, we will refer to as expr1, expr2 and expr3.
NVL2 lets you determine the value returned by a query based on whether a specified expression is null or not null. If expr1 is not null, then NVL2 returns expr2. If expr1 is null, then NVL2 returns expr3.


2.0. NVL2 generic UDF in Hive


1: Create the test data file for a Hive external table

2: Create the Hive table

3: Create the UDF in Java


4: Expected results

5: Try out the UDF



3.0. Making the UDF permanently available when you launch the hive shell

There are several ways to make a custom UDF available when you launch the Hive shell, bypassing the need to execute the "add jar..." statement before using a custom UDF.  I have listed a couple of them.

Option 1:
From "Programming Hive"

Your function may also be added permanently to Hive, however this requires a small modification to a Hive Java file and then rebuilding Hive.
Inside the Hive source code, a one-line change is required to the FunctionRegistry class found atql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java. Then you rebuild Hive following the instructions that come with the source distribution.
While it is recommended that you redeploy the entire new build, only the hive-exec-*.jar, where \* is the version number, needs to be replaced.
Option 2:
Add it to the .hiverc file on each node from where hive queries will be run.


4.0. References

Apache documentation:
http://hive.apache.org/docs/r0.10.0/api/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.html
https://cwiki.apache.org/confluence/display/Hive/OperatorsAndFunctions

A good article on creating a UDF that involves non-primitive types - link

Programming Hive - from O'Reilly

That's it for this blog.  Do share any additional insights with me.
Cheers!


Monday, November 4, 2013

UDF's Part 1: Custom simple eval UDFs in Pig and Hive (NVL2)


1.0. What's in this blog?


A demonstration of creating a custom simple eval UDF to mimic NVL2 functionality from the DBMS world, in Pig and Hive.  It includes sample data, java code for creating the UDF, expected results, commands to execute and the output.

About NVL2:
NVL2 takes three parameters, we will refer to as expr1, expr2 and expr3.
NVL2 lets you determine the value returned by a query based on whether a specified expression is null or not null. If expr1 is not null, then NVL2 returns expr2. If expr1 is null, then NVL2 returns expr3.

2.0. NVL2 UDF in Hive


1: Create the test data file for a Hive external table

2: Create the Hive table

3: Create the UDF in Java

4: Expected results

5: Test the UDF

3.0. NVL2 UDF in Pig


We will reuse data from section 2.
1: Create the UDF in Java

2: Create the pig script

3: Test the UDF
[Modify path of the data file between local and HDFS locations in the pig script - better - make it parameterized]

4: Results



Do share any additional insights/comments.
Cheers!

Follow me on Twitter:

Wednesday, October 30, 2013

Apache Oozie - Part 13: Oozie SSH Action


1.0. What's covered in the blog?

1. Documentation on the Oozie SSH action
2. Sample oozie workflow application that demonstrates the SSH action - SSH to a specific node, as a specified user, and executes a local shell script that loads a local file to HDFS.

It was tricky getting this action working - and the solution is not something covered in the Apache documentation.  Issues and resolution are documented below.  

Version:
Oozie 3.3.0

Related blogs:
Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie workflow - shell action +passing output from one action to another
Blog 13: Oozie workflow - SSH action


2.0. Documentation on the Oozie SSH Action


Apache documentation is available at - http://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a3.2.5_Ssh_Action

Note: The functionality was going to be eventually removed but later decided that it would remain.
So, disregard any mention of deprecation.


3.0. Sample workflow application


3.0.1. Highlights:
Oozie server is running on node cdh-dev01 in my environment.
With the sample workflow application, I am going to submit an Oozie job while logged in as myself (akhanolk), on this machine (Oozie server - cdh-dev01) from the CLI.
The workflow executes a shell script on cdh-dn01 as user akhanolk.  The shell script loads a local file to HDFS.  If the file load completes successfully, the workflow sends an email to me.

3.0.2. Pictorial overview:













3.0.3. SSH setup:
1. Passphrase-less SSH for akhanolk from cdh-dev01 (Oozie server) to cdh-dn01 (remote node) and vice versa
2.  Passphrase-less SSH for oozie user ID (oozie in my case) on cdh-dev01 to cdh-dn01 as akhanolk
[Running ps -ef | grep oozie on Oozie server will give you the configured Oozie user ID]

3.0.4. Workflow application components:
workflow definition (workflow.xml - in HDFS)
job properties file (job.properties from node submitting job)
Shell script (uploadFile.sh) on remote node (cdh-dn01; At /home/akhanolk/scripts)
Data file (employees_data) on remote node (cdh-dn01; At /home/akhanolk/data)

3.0.5. Desired result:
Upon execution of the workflow, the employees_data on cdh-dn01 should get moved to a specified directory in HDFS

3.0.6. Subsequent sections cover-

  1. Data and script download
  2. Oozie job properties file        
  3. Oozie workflow  file
  4. Shell script - uploadFile.sh
  5. Data load commands                
  6. Oozie SMTP configuration
  7. SSH setup          
  8. Oozie commands                    
  9. Output in HDFS
  10. Output email                      
  11. Oozie web console - screenshots
  12. Issues encountered and resolution



3.0.7. Data and script download:



3.0.8. Oozie job.properties file:



3.0.9. Oozie workflow.xml:



3.0.10. Shell script (fileUpload.sh):



3.0.11. HDFS load commands:



3.0.12. Oozie SMTP configuration:



3.0.13. Oozie SSH setup:



3.0.14. Oozie commands:



3.0.15. Output in HDFS:



3.0.16. Output email:



3.0.17. Issues encountered:



3.0.18. Oozie web console - screenshots:



































Any additional insights are greatly appreciated.
Cheers!

New Impala e-Book from O’Reilly Media - Free

Folks,
Check this out...
http://blog.cloudera.com/blog/2013/10/download-the-new-impala-e-book-from-oreilly-media/

Download location:
http://www.cloudera.com/content/cloudera/en/resources/library/aboutcloudera/cloudera-impala-ebook.html

Thanks to Manish Verma, for emailing me the link.

Cheers!


Tuesday, September 24, 2013

Reduce-side joins in Java map-reduce

1.0. About reduce side joins

Joins of datasets done in the reduce phase are called reduce side joins.  Reduce side joins are easier to implement as they are less stringent than map-side joins that require the data to be sorted and partitioned the same way.  They are less efficient than maps-side joins because  the datasets have to go through the sort and shuffle phase.

What's involved..
1.  The key of the map output, of datasets being joined, has to be the join key - so they reach the same reducer
2.  Each dataset has to be tagged with its identity, in the mapper- to help differentiate between the datasets in the reducer, so they can be processed accordingly.
3.  In each reducer, the data values from both datasets, for keys assigned to the reducer, are available, to be processed as required.
4.  A secondary sort needs to be done to ensure the ordering of the values sent to the reducer
5.  If the input files are of different formats, we would need separate mappers, and we would need to use MultipleInputs class in the driver to add the inputs and associate the specific mapper to the same.
[MultipleInputs.addInputPath( job, (input path n), (inputformat class), (mapper class n));]

Note:  The join between the datasets (employee, current salary - cardinality of 1..1) in the sample program below has been demonstrated in my blog on map side joins of large datasets, as well.  I have used the same datasets here...as the purpose of this blog is to demonstrate the concept.  Whenever possible, reduce-side joins should be avoided.

[Update - 10/15/2013]
I have added a pig equivalent in the final section.

2.0. Sample datasets used in this gist

The datasets used are employees and salaries.  For salary data, there are two files - one file with  current salary (1..1), and one with historical salary data (1..many). Then there is the department data, a small reference dataset, that we will add to distributed cache and look up in the reducer.


3.0. Implementation a reduce-side join 

The sample code is common for a 1..1 as well as 1..many join for the sample datasets.
The mapper is common for both datasets, as the format is the same.

3.0.1. Components/steps/tasks:

1.  Map output key
The key will be the empNo as it is the join key for the datasets employee and salary
[Implementation: in the mapper]

2.  Tagging the data with the dataset identity
Add an attribute called srcIndex to tag the identity of the data (1=employee, 2=salary, 3=salary history)
[Implementation: in the mapper]

3.  Discarding unwanted atributes
[Implementation: in the mapper]

4. Composite key
Make the map output key a composite of empNo and srcIndex
[Implementation: create custom writable]

5.  Partitioner
Partition the data on natural key of empNo
[Implementation: create custom partitioner class]

5.  Sorting
Sort the data on empNo first, and then source index
[Implementation: create custom sorting comparator class]

6.  Grouping
Group the data based on natural key
[Implementation: create custom grouping comparator class]

7. Joining
Iterate through the values for a key and complete the join for employee and salary data, perform lookup of department to include department name in the output
[Implementation: in the reducer]

3.0.2a. Data pipeline for cardinality of 1..1 between employee and salary data:








































3.0.2b. Data pipeline for cardinality of 1..many between employee and salary data:

























3.0.3. The Composite key

The composite key is a combination of the joinKey empNo, and the source Index (1=employee file.., 2=salary file...)


3.0.4. The mapper

In the setup method of the mapper-
1. Get the filename from the input split, cross reference it against the configuration (set in driver), to derive the source index.  [Driver code: Add configuration [key=filename of employee,value=1], [key=filename of current salary dataset,value=2], [key=filename of historical salary dataset,value=3]
2. Build a list of attributes we cant to emit as map output for each data entity

The setup method is called only once, at the beginning of a map task.  So it is the logical place to to identify the source index.

In the map method of the mapper:
3. Build the map output based on attributes required, as specified in the list from #2

Note:  For salary data, we are including the "effective till" date, even though it is not required in the final output because this is common code for a 1..1 as well as 1..many join to salary data.  If the salary data is historical, we want the current salary only, that is "effective till date= 9999-01-01".


3.0.5. The partitioner

Even though the map output key is composite, we want to partition by the natural join key of empNo, therefore a custom partitioner is in order.


3.0.6. The sort comparator

To ensure that the input to the reducer is sorted on empNo, then on sourceIndex, we need a sort comparator.  This will guarantee that the employee data is the first set in the values list for a key, then the salary data.


3.0.7. The grouping comparator

This class is needed to indicate the group by attribute - the natural join key of empNo


3.0.8. The reducer

In the setup method of the reducer (called only once for the task)-
We are checking if the side data, a map file with department data is in the distributed cache and if found, initializing the map file reader

In the reduce method, -
While iterating through the value list -
1. If the data is employee data (sourceIndex=1), we are looking up the department name in the map file with the deptNo, which is the last attribute in the employee data, and appending the department name to the employee data.
2. If the data is historical salary data, we are only emitting salary where the last attribute is '9999-01-01'.

Key point-
We have set the sort comparator to sort on empNo and sourceIndex.
The sourceIndex of employee data is lesser than salary data - as set in the driver.
Therefore, we are assured that the employee data is always first followed by salary data.
So for each distinct empNo, we are iterating through the values, and appending the same and emitting as output.



3.0.9. The driver

Besides the usual driver code, we are-
1. Adding side data (department lookup data in map file format - in HDFS) to the distributed cache
2. Adding key-value pairs to the configuration, each key value pair being filename, source index.
This is used by the mapper, to tag data with sourceIndex.
3. And lastly, we are associating all the various classes we created to the job.



4.0. The pig equivalent



Pig script-version 1:



Pig script-version 2 - eliminating the reduce-side join:
In this script, we are filtering on most recent salary, and then using the merge join optimization (map-side) in Pig, that can be leveraged on sorted input to the join.


Output:

Sunday, September 22, 2013

Map-side join of large datasets using CompositeInputFormat

This post covers, map-side join of large datasets using CompositeInputFormat, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Hive and Pig rock and rule at joining datasets, but it helps to know how to perform joins in java.

Update [10/15/2013]
I have added the pig equivalent at the very bottom of the gist.

Feel free to share any insights or constructive criticism. Cheers!!

Related blogs:
1. Map-side join sample in Java using reference data (text file) from distributed cache - Part 1
2. Map-side join sample in Java using reference data (MapFile) from distributed cache - Part 2
3. Map-side join sample in Java of two large datasets, leveraging CompositeInputFormat

Sample program:

Friday, September 20, 2013

Handling small files using CombineFileInputFormat in Java MapReduce

This post covers, CombineFileInputFormat, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!

NLineInputFormat in Java MapReduce - use case, code sample

This post covers, NLineInputFormat, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!

Thursday, September 19, 2013

MultipleOutputs in Java MapReduce

This post covers, MultipleOutputs, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!

Wednesday, September 18, 2013

Secondary sort in Java MapReduce

This post covers, secondary sort in Java mapreduce, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!

Tuesday, September 17, 2013

Map-side join sample in Java using reference data (MapFile) from distributed cache - Part 2

This post covers, map-side join in Java map-reduce, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!


What's in this blog?

A sample map-reduce program in Java that joins two datasets, on the map-side - an employee dataset and a department dataset, with the department number as join key.  The department dataset is a very small dataset in MapFile format, is in HDFS, and is added to the distributed cache.  The MapFile is referenced in the map method of the mapper to look up the department name, and emit the employee dataset with department name included.

Apache documentation on DistributedCache:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html

Related blogs:
1. Map-side join sample in Java using reference data (text file) from distributed cache - Part 1
2. Map-side join sample in Java using reference data (MapFile) from distributed cache - Part 2

Data used in this blog:
http://dev.mysql.com/doc/employee/en.index.html

Pig and Hive for joins:
Pig and Hive have join capabilities built-in, and are optimized for the same.  Programs with joins written in java are more performant, but time-consuming to code, test and support - and in some companies considered an anti-pattern for joins.

Sample program

Monday, September 16, 2013

Map-side join sample in Java using reference data (text file) from distributed cache - Part 1

This post covers, map-side join in Java map-reduce, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!

1.0. What's in this blog?

A sample map-reduce program in Java that joins two datasets, on the map-side - an employee dataset and a department dataset, with the department number as join key.  The department dataset is a very small dataset, is reference data, is in HDFS, and is added to the distributed cache.  The mapper program retrieves the department data available through distributed cache and and loads the same into a HashMap in the setUp() method of the mapper, and the HashMap is referenced in the map method to get the department name, and emit the employee dataset with department name included.

Section 2 demonstrates a solution where a file in HDFS is added to the distributed cache in the driver code, and accessed in the mapper setup method through the distributedcache.getCacheFiles method.

Section 3 demonstrates a solution where a local file is added to the distributed cache at the command line, and accessed in the mapper setup method.

Apache documentation on DistributedCache:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html

Related blogs:
1. Map-side join sample using reference data (text file) from distributed cache - Part 1
2. Map-side join sample in Java using reference data (MapFile) from distributed cache - Part 2

Data used in this blog:
http://dev.mysql.com/doc/employee/en.index.html

Pig and Hive for joins:
Pig and Hive have join capabilities built-in, and are optimized for the same.  Programs with joins written in java are more performant, but time-consuming to code, test and support - and in some companies considered an anti-pattern for joins.

2.0. Sample program

In this program, the side data, exists in HDFS, and is added to the distributedcache in the driver code, and referenced in the mapper using DistributedCache.getfiles method.



3.0. Variation 

As a variation to the code in section 2.0, this section demonstrates how to add side data that is not in HDFS to distributed cache, through command line, leveraging GenericOptionsParser



   

Friday, September 13, 2013

Sequence File - construct, usage, code samples

This post covers, sequence file format, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!

1.0. What's in this blog?


1.  Introduction to sequence file format
2.  Sample code to create a sequence file (compressed and uncompressed), from a text file, in a map reduce program, and to read a sequence file.

2.0. What's a Sequence File?


2.0.1. About sequence files:
A sequence file is a persistent data structure for binary key-value pairs.

2.0.2. Construct:
Sequence files have sync points included after every few records, that align with record boundaries, aiding the reader to sync.  The sync points support splitting of files for mapreduce operations.  Sequence files support record-level and block-level compression.

Apache documentation: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html

Excerpts from Hadoop the definitive guide...
"A sequence file consists of a header followed by one or more records. The first three bytes of a sequence file are the bytes SEQ, which acts as a magic number, followed by a single byte representing the version number. The header contains other fields, including the names of the key and value classes, compression details, user-defined metadata, and the sync marker. 
Structure of sequence file with and without record compression-
















The format for record compression is almost identical to no compression, except the value bytes are compressed using the codec defined in the header. Note that keys are not compressed.
Structure of sequence file with and without block compression-











Block compression compresses multiple records at once; it is therefore more compact than and should generally be preferred over record compression because it has the opportunity to take advantage of similarities between records. A sync marker is written before the start of every block. The format of a block is a field indicating the number of records in the block, followed by four compressed fields: the key lengths, the keys, the value lengths, and the values."

The uncompressed, record-compressed and block-compressed sequence files, share the same header.  Details are below, from the Apache documentation, on sequence files.

SequenceFile Header
  • version - 3 bytes of magic header SEQ, followed by 1 byte of actual version number (e.g. SEQ4 or SEQ6)
  • keyClassName -key class
  • valueClassName - value class
  • compression - A boolean which specifies if compression is turned on for keys/values in this file.
  • blockCompression - A boolean which specifies if block-compression is turned on for keys/values in this file.
  • compression codec - CompressionCodec class which is used for compression of keys and/or values (if compression is enabled).
  • metadata - SequenceFile.Metadata for this file.
  • sync - A sync marker to denote end of the header.
Uncompressed SequenceFile Format
  • Header
  • Record
  • Record length
  • Key length
  • Key
  • Value
  • A sync-marker every few 100 bytes or so.
Record-Compressed SequenceFile Format
  • Header
  • Record
  • Record length
  • Key length
  • Key
  • Compressed Value
  • A sync-marker every few 100 bytes or so.
Block-Compressed SequenceFile Format
  • Header
  • Record Block
  • Uncompressed number of records in the block
  • Compressed key-lengths block-size
  • Compressed key-lengths block
  • Compressed keys block-size
  • Compressed keys block
  • Compressed value-lengths block-size
  • Compressed value-lengths block
  • Compressed values block-size
  • Compressed values block
  • A sync-marker every block.
2.0.3. Datatypes: 
The keys and values need not be instances of Writable, just need to support serialization.

2.0.4. Creating sequence files: 
Uncompressed: Create an instance of SequenceFile.Writer and call append(), to add key-values, in order.  For record and block compressed, refer the Apache documentation.  When creating compressed files, the actual compression algorithm used to compress key and/or values can be specified by using the appropriate CompressionCodec.

2.0.5. Reading data in sequence files: 
Create an instance of SequenceFile.Reader, and iterate through the entries using reader.next(key,value).

2.0.6. Usage
- Data storage for key-value type data
- Container for other files
- Efficient from storage perspective (binary), efficient from a mapreduce processing perspective (supports compression, and splitting)

3.0. Creating a sequence file


4.0. Reading a sequence file

Covered already in the gist under section 3.

5.0. Any thoughts/comments

Any constructive criticism and/or additions/insights is much appreciated.

Cheers!!





Thursday, September 12, 2013

Map File - construct, usage, code samples

This post covers, map file format, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!

1.0. What's in this blog?

1.  Introduction to map file
2.  Sample code to convert a text file to a map file
3.  Sample code to read a map file

2.0. What's a Map File?

2.0.1. Definition:
From Hadoop the Definitive Guide..
A MapFile is a sorted SequenceFile with an index to permit lookups by key. MapFile can be thought of as a persistent form of java.util.Map (although it doesn’t implement this interface), which is able to grow beyond the size of a Map that is kept in memory.
Apache documentation:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/MapFile.html 

2.0.2. Datatypes: 
The keys must be instances of WritableComparable, and the values, Writable.

2.0.3. Creating map files: 
Create an instance of MapFile.Writer and call append(), to add key-values, in order.

2.0.4. Looking up data in map files: 
Create an instance of MapFile.Reader, and call get(key,value).

2.0.5. Construct
The map file is actually a directory.  Within the same, there is an "index" file, and a "data" file.
The data file is a sequence file and has keys and associated values.
The index file is smaller, has key value pairs with the key being the actual key of the data, and the value, the byte offset.  The index file has a fraction of the keys and is determined by MapFile.Writer.GetIndexInterval().

2.0.5.1. Directory structure:
$ hadoop fs -ls formatProject/data/departments_map | awk '{print $8}'
formatProject/data/departments_map/data
formatProject/data/departments_map/index

2.0.5.2. Content of the file 'data':
$ hadoop fs -text formatProject/data/departments_map/data
d001  Marketing
d002 Finance
d003 Human Resources
d004 Production
d005 Development
d006 Quality Management
d007 Sales
d008 Research
d009 Customer Service

2.0.5.3. Content of the file 'index':
$ hadoop fs -text formatProject/data/departments_map/index
d001 121
d002 152
d003 181
d004 218
d005 250
d006 283
d007 323
d008 350
d009 380

2.0.6. Behind the scenes of a look up
The index file is read into memory, the key less than or equal to the one being looked up is (binary) searched for, and the reader seeks to this key and reads up to key being looked up, extracts and returns the value associated with the key.  Returns a null if the key is not found.

If the map file is too large to load into memory, there are configurations that can be set to skip keys in the index.   

2.0.7. Usage
Fast lookups - in joins, among others.
Can also be used as a container for small files, with the filename as the key.

3.0. Creating a map file


4.0. Looking up a key in a map file

Covered already in the gist under section 3.
The plan is to use the map file in a map-side join in a subsequent blog.

5.0. Any thoughts/comments

Any constructive criticism and/or additions/insights is much appreciated.

Cheers!!

Wednesday, September 11, 2013

Apache Oozie - Part 12: Oozie Shell Action + Passing output from one Oozie action to another

I had read about the Oozie capability to allow passing output from one action to another and forgotten about it, sure enough, it came up at an interview.  Here's some sample code...


1.0. What's covered in the blog?

1. Documentation on the Oozie shell action
2. A sample oozie workflow that includes a shell script action that echoes a count of the number of lines in a file glob, and an email action that captures the output of the shell action and email it.

Version:
Oozie 3.3.0; Pig 0.10.0

Related blogs:
Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11a: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie shell action + passing output from one action to another


2.0. Documentation on the Oozie Shell Action


Apache documentation is available at - http://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html


3.0. Sample program



4.0. Oozie web console screenshots












Wednesday, August 28, 2013

Apache Hive: The .hiverc file

What is .hiverc file?

It is a file that is executed when you launch the hive shell - making it an ideal place for adding any hive configuration/customization you want set, on start of the hive shell. This could be:
- Setting column headers to be visible in query results
- Making the current database name part of the hive prompt
- Adding any jars or files
- Registering UDFs

.hiverc file location

The file is loaded from the hive conf directory.
I have the CDH4.2 distribution and the location is: /etc/hive/conf.cloudera.hive1
If the file does not exist, you can create it.
It needs to be deployed to every node from where you might launch the Hive shell.
[Note: I had to create the file;  The distribution did not come with it.]

Sample .hiverc

add jar /home/airawat/hadoop-lib/hive-contrib-0.10.0-cdh4.2.0.jar;
set hive.exec.mode.local.auto=true;
set hive.cli.print.header=true;
set hive.cli.print.current.db=true;
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=30000000;








Tuesday, July 16, 2013

Apache Oozie - Part 8: Subworkflow


1.0. What's covered in the blog?

1) Apache documentation on sub-workflows
2) A sample program that includes components of a oozie workflow application with a java main action and a subworkflow containing a sqoop action.  Scripts/code, sample dataset and commands are included;  Oozie actions covered: java action, sqoop action (mysql database); 

Versions:
Oozie 3.3.0, Sqoop (1.4.2) with Mysql (5.1.69)

Related blogs:
Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11a: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie workflow - shell action +passing output from one action to another


2.0. Apache documentation on sub-workflows


The sub-workflow action runs a child workflow job, the child workflow job can be in the same Oozie system or in another Oozie system.  The parent workflow job will wait until the child workflow job has completed.

Syntax:






















The child workflow job runs in the same Oozie system instance where the parent workflow job is running.
The app-path element specifies the path to the workflow application of the child workflow job.
The propagate-configuration flag, if present, indicates that the workflow job configuration should be propagated to the child workflow.

The configuration section can be used to specify the job properties that are required to run the child workflow job.  The configuration of the sub-workflow action can be parameterized (templatized) using EL expressions.

Link to Apache documentation:
http://oozie.apache.org/docs/3.3.0/WorkflowFunctionalSpec.html#a3.2.6_Sub-workflow_Action

Note:
For a typical on-demand workflow, you have core components - job.properties and workflow.xml.  For a sub workflow, you need yet another workflow.xml that clearly defines activities to occur in the sub-workflow.  In the parent workflow, the sub-workflow is referenced.  To keep it neat, best to have a sub-directory to hold the sub-workflow core components.  Also, a single job.properties is sufficient. 

3.0. Sample workflow application

Highlights:
The workflow has two actions - one is a java main action and the other is a sub-workflow action.

The java main action parses log files on hdfs and generates a report.
The sub-workflow action executes after success of the java main action, and pipes the report in hdfs to mysql database.


Pictorial overview:





Components of such a workflow application:
























Application details:




Oozie web console - screenshots: