Sunday, September 22, 2013

Map-side join of large datasets using CompositeInputFormat

This post covers, map-side join of large datasets using CompositeInputFormat, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Hive and Pig rock and rule at joining datasets, but it helps to know how to perform joins in java.

Update [10/15/2013]
I have added the pig equivalent at the very bottom of the gist.

Feel free to share any insights or constructive criticism. Cheers!!

Related blogs:
1. Map-side join sample in Java using reference data (text file) from distributed cache - Part 1
2. Map-side join sample in Java using reference data (MapFile) from distributed cache - Part 2
3. Map-side join sample in Java of two large datasets, leveraging CompositeInputFormat

Sample program:

8 comments:

  1. Really good blog :) thanks for sharing.

    I have one question on the CompositeInputFormat. How does it work internally? It looks like it will read the all the paths and do a inner join on the key and the result will be passed on to the mapper. If this is true, other than "inner" as op is there any other operation that can be done?

    ReplyDelete
  2. Hi,

    I would like to know how to deal with the case where you have unsorted data. For example the input data was as follows:

    Employee data [joinProject/data/employees_sorted/part-e]
    --------------------------------------------------------
    [EmpNo,DOB,FName,LName,Gender,HireDate,DeptNo]
    10001,1953-09-02,Georgi,Facello,M,1986-06-26,d005
    10002,1964-06-02,Bezalel,Simmel,F,1985-11-21,d007
    10003,1959-12-03,Parto,Bamford,M,1986-08-28,d004
    10004,1954-05-01,Chirstian,Koblick,M,1986-12-01,d004
    10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12,d003
    10006,1953-04-20,Anneke,Preusig,F,1989-06-02,d005
    .....
    Salary data [joinProject/data/salaries_sorted/part-s]
    ------------------------------------------------------
    [EmpNo,Salary,FromDate,ToDate]
    10001,88958,2002-06-22,9999-01-01
    10002,72527,2001-08-02,9999-01-01
    10004,43311,2001-12-01,9999-01-01
    10003,74057,2001-11-27,9999-01-01
    10005,94692,2001-09-09,9999-01-01
    ..........

    and the desired output was :

    ************************************
    Expected Results - tab separated
    ************************************
    [EmpNo FName LName Salary]
    10001 Georgi Facello 88958
    10002 Bezalel Simmel 72527
    10003 Parto Bamford 43311
    10004 Chirstian Koblick 74057
    10005 Kyoichi Maliniak 94692
    10006 Anneke Preusig 59755
    10009 Sumant Peac 94409
    10010 Duangkaew Piveteau 80324

    ReplyDelete
  3. very useful. Do we need to have same number of records in both the tables?
    I tried this with both inner and outer. Inner join gives me the expected results but outer join is not. Can you try outer join and let me know. My code is here
    https://github.com/swapnaraja/hadoopCode/tree/master/mapReduce/joins/compositeinput

    ReplyDelete
  4. CIITN is located in Prime location in Noida having best connectivity via all modes of public transport. CIITN offer both weekend and weekdays courses to facilitate Hadoop aspirants. Among all Hadoop Training Institute in Noida , CIITN's Big Data and Hadoop Certification course is designed to prepare you to match all required knowledge for real time job assignment in the Big Data world with top level companies. CIITN puts more focus in project based training and facilitated with Hadoop 2.7 with Cloud Lab—a cloud-based Hadoop environment lab setup for hands-on experience.

    CIITNOIDA is the good choice for Big Data Hadoop Training in NOIDA in the final year. I have also completed my summer training from here. It provides high quality Hadoop training with Live projects. The best thing about CIITNOIDA is its experienced trainers and updated course content. They even provide you placement guidance and have their own development cell. You can attend their free demo class and then decide.

    Hadoop Training in Noida
    Big Data Hadoop Training in Noida

    ReplyDelete
  5. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete