Hooked on Hadoop: Map-side join of large datasets using CompositeInputFormat

Sunday, September 22, 2013

Map-side join of large datasets using CompositeInputFormat

This post covers, map-side join of large datasets using CompositeInputFormat, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Hive and Pig rock and rule at joining datasets, but it helps to know how to perform joins in java.

Update [10/15/2013]
I have added the pig equivalent at the very bottom of the gist.

Feel free to share any insights or constructive criticism. Cheers!!

Related blogs:
1. Map-side join sample in Java using reference data (text file) from distributed cache - Part 1
2. Map-side join sample in Java using reference data (MapFile) from distributed cache - Part 2
3. Map-side join sample in Java of two large datasets, leveraging CompositeInputFormat

Sample program:

8 comments:

UnknownJanuary 9, 2014 at 4:44 AM
thank you, very helpful
ReplyDelete
Replies
vishnuMay 1, 2014 at 1:00 AM
Really good blog :) thanks for sharing.

I have one question on the CompositeInputFormat. How does it work internally? It looks like it will read the all the paths and do a inner join on the key and the result will be passed on to the mapper. If this is true, other than "inner" as op is there any other operation that can be done?
ReplyDelete
Replies
UnknownOctober 1, 2014 at 8:20 PM
Hi,

I would like to know how to deal with the case where you have unsorted data. For example the input data was as follows:

Employee data [joinProject/data/employees_sorted/part-e]
--------------------------------------------------------
[EmpNo,DOB,FName,LName,Gender,HireDate,DeptNo]
10001,1953-09-02,Georgi,Facello,M,1986-06-26,d005
10002,1964-06-02,Bezalel,Simmel,F,1985-11-21,d007
10003,1959-12-03,Parto,Bamford,M,1986-08-28,d004
10004,1954-05-01,Chirstian,Koblick,M,1986-12-01,d004
10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12,d003
10006,1953-04-20,Anneke,Preusig,F,1989-06-02,d005
.....
Salary data [joinProject/data/salaries_sorted/part-s]
------------------------------------------------------
[EmpNo,Salary,FromDate,ToDate]
10001,88958,2002-06-22,9999-01-01
10002,72527,2001-08-02,9999-01-01
10004,43311,2001-12-01,9999-01-01
10003,74057,2001-11-27,9999-01-01
10005,94692,2001-09-09,9999-01-01
..........

and the desired output was :

************************************
Expected Results - tab separated
************************************
[EmpNo FName LName Salary]
10001 Georgi Facello 88958
10002 Bezalel Simmel 72527
10003 Parto Bamford 43311
10004 Chirstian Koblick 74057
10005 Kyoichi Maliniak 94692
10006 Anneke Preusig 59755
10009 Sumant Peac 94409
10010 Duangkaew Piveteau 80324

ReplyDelete
Replies
UnknownAugust 5, 2015 at 2:22 PM
Very useful. Thanks
ReplyDelete
Replies
swapnaSeptember 3, 2015 at 10:48 AM
very useful. Do we need to have same number of records in both the tables?
I tried this with both inner and outer. Inner join gives me the expected results but outer join is not. Can you try outer join and let me know. My code is here
https://github.com/swapnaraja/hadoopCode/tree/master/mapReduce/joins/compositeinput
ReplyDelete
Replies
UnknownJuly 8, 2018 at 5:13 AM
thakyou it vry nice blog for beginners
https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/
ReplyDelete
Replies
UnknownAugust 21, 2018 at 10:35 PM
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

Big Data Hadoop training in electronic city
ReplyDelete
Replies
AnonymousNovember 20, 2018 at 1:54 AM
Nice blog..! I really loved reading through this article. Thanks for sharing such
a amazing post with us and keep blogging...pmp training Chennai | pmp training centers in Chenai | pmp training institutes in Chennai | pmp training and certification in Chennai | pmp training in velachery
ReplyDelete
Replies

Add comment

Hooked on Hadoop

Sunday, September 22, 2013

Map-side join of large datasets using CompositeInputFormat

8 comments:

Search

Blog archive

Popular Posts

Total Pageviews