This post covers, map-side join of large datasets using CompositeInputFormat, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Hive and Pig rock and rule at joining datasets, but it helps to know how to perform joins in java.
Update [10/15/2013]
I have added the pig equivalent at the very bottom of the gist.
Feel free to share any insights or constructive criticism. Cheers!!
Related blogs:
1. Map-side join sample in Java using reference data (text file) from distributed cache - Part 1
2. Map-side join sample in Java using reference data (MapFile) from distributed cache - Part 2
3. Map-side join sample in Java of two large datasets, leveraging CompositeInputFormat
Sample program:
Showing posts with label Map-side join. Show all posts
Showing posts with label Map-side join. Show all posts
Sunday, September 22, 2013
Tuesday, September 17, 2013
Map-side join sample in Java using reference data (MapFile) from distributed cache - Part 2
This post covers, map-side join in Java map-reduce, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!
What's in this blog?
A sample map-reduce program in Java that joins two datasets, on the map-side - an employee dataset and a department dataset, with the department number as join key. The department dataset is a very small dataset in MapFile format, is in HDFS, and is added to the distributed cache. The MapFile is referenced in the map method of the mapper to look up the department name, and emit the employee dataset with department name included.
Apache documentation on DistributedCache:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
Related blogs:
1. Map-side join sample in Java using reference data (text file) from distributed cache - Part 1
2. Map-side join sample in Java using reference data (MapFile) from distributed cache - Part 2
Data used in this blog:
http://dev.mysql.com/doc/employee/en.index.html
Pig and Hive for joins:
Pig and Hive have join capabilities built-in, and are optimized for the same. Programs with joins written in java are more performant, but time-consuming to code, test and support - and in some companies considered an anti-pattern for joins.
Related blogs:
1. Map-side join sample in Java using reference data (text file) from distributed cache - Part 1
2. Map-side join sample in Java using reference data (MapFile) from distributed cache - Part 2
Data used in this blog:
http://dev.mysql.com/doc/employee/en.index.html
Pig and Hive for joins:
Pig and Hive have join capabilities built-in, and are optimized for the same. Programs with joins written in java are more performant, but time-consuming to code, test and support - and in some companies considered an anti-pattern for joins.
Sample program
Monday, September 16, 2013
Map-side join sample in Java using reference data (text file) from distributed cache - Part 1
This post covers, map-side join in Java map-reduce, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!
1.0. What's in this blog?
A sample map-reduce program in Java that joins two datasets, on the map-side - an employee dataset and a department dataset, with the department number as join key. The department dataset is a very small dataset, is reference data, is in HDFS, and is added to the distributed cache. The mapper program retrieves the department data available through distributed cache and and loads the same into a HashMap in the setUp() method of the mapper, and the HashMap is referenced in the map method to get the department name, and emit the employee dataset with department name included.
Section 2 demonstrates a solution where a file in HDFS is added to the distributed cache in the driver code, and accessed in the mapper setup method through the distributedcache.getCacheFiles method.
Section 3 demonstrates a solution where a local file is added to the distributed cache at the command line, and accessed in the mapper setup method.
Section 2 demonstrates a solution where a file in HDFS is added to the distributed cache in the driver code, and accessed in the mapper setup method through the distributedcache.getCacheFiles method.
Section 3 demonstrates a solution where a local file is added to the distributed cache at the command line, and accessed in the mapper setup method.
Apache documentation on DistributedCache:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
Related blogs:
1. Map-side join sample using reference data (text file) from distributed cache - Part 1
2. Map-side join sample in Java using reference data (MapFile) from distributed cache - Part 2
Data used in this blog:
http://dev.mysql.com/doc/employee/en.index.html
Pig and Hive for joins:
Pig and Hive have join capabilities built-in, and are optimized for the same. Programs with joins written in java are more performant, but time-consuming to code, test and support - and in some companies considered an anti-pattern for joins.
Related blogs:
1. Map-side join sample using reference data (text file) from distributed cache - Part 1
2. Map-side join sample in Java using reference data (MapFile) from distributed cache - Part 2
Data used in this blog:
http://dev.mysql.com/doc/employee/en.index.html
Pig and Hive for joins:
Pig and Hive have join capabilities built-in, and are optimized for the same. Programs with joins written in java are more performant, but time-consuming to code, test and support - and in some companies considered an anti-pattern for joins.
2.0. Sample program
In this program, the side data, exists in HDFS, and is added to the distributedcache in the driver code, and referenced in the mapper using DistributedCache.getfiles method.
3.0. Variation
As a variation to the code in section 2.0, this section demonstrates how to add side data that is not in HDFS to distributed cache, through command line, leveraging GenericOptionsParser
Subscribe to:
Posts (Atom)