Tuesday, September 24, 2013

Reduce-side joins in Java map-reduce

1.0. About reduce side joins

Joins of datasets done in the reduce phase are called reduce side joins.  Reduce side joins are easier to implement as they are less stringent than map-side joins that require the data to be sorted and partitioned the same way.  They are less efficient than maps-side joins because  the datasets have to go through the sort and shuffle phase.

What's involved..
1.  The key of the map output, of datasets being joined, has to be the join key - so they reach the same reducer
2.  Each dataset has to be tagged with its identity, in the mapper- to help differentiate between the datasets in the reducer, so they can be processed accordingly.
3.  In each reducer, the data values from both datasets, for keys assigned to the reducer, are available, to be processed as required.
4.  A secondary sort needs to be done to ensure the ordering of the values sent to the reducer
5.  If the input files are of different formats, we would need separate mappers, and we would need to use MultipleInputs class in the driver to add the inputs and associate the specific mapper to the same.
[MultipleInputs.addInputPath( job, (input path n), (inputformat class), (mapper class n));]

Note:  The join between the datasets (employee, current salary - cardinality of 1..1) in the sample program below has been demonstrated in my blog on map side joins of large datasets, as well.  I have used the same datasets here...as the purpose of this blog is to demonstrate the concept.  Whenever possible, reduce-side joins should be avoided.

[Update - 10/15/2013]
I have added a pig equivalent in the final section.

2.0. Sample datasets used in this gist

The datasets used are employees and salaries.  For salary data, there are two files - one file with  current salary (1..1), and one with historical salary data (1..many). Then there is the department data, a small reference dataset, that we will add to distributed cache and look up in the reducer.


3.0. Implementation a reduce-side join 

The sample code is common for a 1..1 as well as 1..many join for the sample datasets.
The mapper is common for both datasets, as the format is the same.

3.0.1. Components/steps/tasks:

1.  Map output key
The key will be the empNo as it is the join key for the datasets employee and salary
[Implementation: in the mapper]

2.  Tagging the data with the dataset identity
Add an attribute called srcIndex to tag the identity of the data (1=employee, 2=salary, 3=salary history)
[Implementation: in the mapper]

3.  Discarding unwanted atributes
[Implementation: in the mapper]

4. Composite key
Make the map output key a composite of empNo and srcIndex
[Implementation: create custom writable]

5.  Partitioner
Partition the data on natural key of empNo
[Implementation: create custom partitioner class]

5.  Sorting
Sort the data on empNo first, and then source index
[Implementation: create custom sorting comparator class]

6.  Grouping
Group the data based on natural key
[Implementation: create custom grouping comparator class]

7. Joining
Iterate through the values for a key and complete the join for employee and salary data, perform lookup of department to include department name in the output
[Implementation: in the reducer]

3.0.2a. Data pipeline for cardinality of 1..1 between employee and salary data:








































3.0.2b. Data pipeline for cardinality of 1..many between employee and salary data:

























3.0.3. The Composite key

The composite key is a combination of the joinKey empNo, and the source Index (1=employee file.., 2=salary file...)


3.0.4. The mapper

In the setup method of the mapper-
1. Get the filename from the input split, cross reference it against the configuration (set in driver), to derive the source index.  [Driver code: Add configuration [key=filename of employee,value=1], [key=filename of current salary dataset,value=2], [key=filename of historical salary dataset,value=3]
2. Build a list of attributes we cant to emit as map output for each data entity

The setup method is called only once, at the beginning of a map task.  So it is the logical place to to identify the source index.

In the map method of the mapper:
3. Build the map output based on attributes required, as specified in the list from #2

Note:  For salary data, we are including the "effective till" date, even though it is not required in the final output because this is common code for a 1..1 as well as 1..many join to salary data.  If the salary data is historical, we want the current salary only, that is "effective till date= 9999-01-01".


3.0.5. The partitioner

Even though the map output key is composite, we want to partition by the natural join key of empNo, therefore a custom partitioner is in order.


3.0.6. The sort comparator

To ensure that the input to the reducer is sorted on empNo, then on sourceIndex, we need a sort comparator.  This will guarantee that the employee data is the first set in the values list for a key, then the salary data.


3.0.7. The grouping comparator

This class is needed to indicate the group by attribute - the natural join key of empNo


3.0.8. The reducer

In the setup method of the reducer (called only once for the task)-
We are checking if the side data, a map file with department data is in the distributed cache and if found, initializing the map file reader

In the reduce method, -
While iterating through the value list -
1. If the data is employee data (sourceIndex=1), we are looking up the department name in the map file with the deptNo, which is the last attribute in the employee data, and appending the department name to the employee data.
2. If the data is historical salary data, we are only emitting salary where the last attribute is '9999-01-01'.

Key point-
We have set the sort comparator to sort on empNo and sourceIndex.
The sourceIndex of employee data is lesser than salary data - as set in the driver.
Therefore, we are assured that the employee data is always first followed by salary data.
So for each distinct empNo, we are iterating through the values, and appending the same and emitting as output.



3.0.9. The driver

Besides the usual driver code, we are-
1. Adding side data (department lookup data in map file format - in HDFS) to the distributed cache
2. Adding key-value pairs to the configuration, each key value pair being filename, source index.
This is used by the mapper, to tag data with sourceIndex.
3. And lastly, we are associating all the various classes we created to the job.



4.0. The pig equivalent



Pig script-version 1:



Pig script-version 2 - eliminating the reduce-side join:
In this script, we are filtering on most recent salary, and then using the merge join optimization (map-side) in Pig, that can be leveraged on sorted input to the join.


Output:

60 comments:

  1. Hi Anagha!

    I was messing around with this. I was able to get the job to run if i disable the reducer, but I get an appending error if I run it with the reducer. I assume this might be do to not having this file: departments_map.tar.gz

    I just wanted to see how I could get the job to run the emp dataset and the salary history dataset ( 1... many) without having to do the join on departments as well. Is the job built out to require the distrib cache tarball?


    Thanks!!

    Dan D. Tran (ddantran19@gmail.com)

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Very very clear explanation, thx a lot

    ReplyDelete
  4. Indeed a very good explanation. Thanks again Anagha

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Here you are using "CompositeKeyWritableRSJ" object as an output key for a mapper. Since, "CompositeKeyWritableRSJ" is a class wouldn't that be enough to override the "hashCode()" method (to return a value such that it is helpful for grouping within a reducer) instead of implementing GroupingComparator (GroupingComparatorRSJ)?

    I mean if we implement a hashcode() method (as below) in CompositeKeyWritableRSJ class:
    @override
    public int hashcode(){
    return joinKey; // empiid string in CompositeKeyWritableRSJ
    }

    wouldn't this be good enough instead of implementing a new "GroupingComparatorRSJ" class. Also, I think if we override the "hashcode()" method then it's not even need to write the "partitioner" class (i.e. PartitionerRSJ) because, I think defautl partitioner uses the hasdCode of the output key (of the mapper)

    ReplyDelete
  7. Thanks for the explanation.
    Sir currently i'm doing a project on LIBRA-A lightweight strategy for solving data skew(The imbalance in the amount of data assign to each reducer) which occurs mainly in reducer side applications.
    So can i use Reduce side join in my project to reduce data skew.
    Please reply me.
    Thank you..

    ReplyDelete
  8. you can visit to the below url , they are offering very good videos on Hadoop:

    For free videos from previous sessions refer:
    http://hadoopbigdatatutorial.com/hadoop-training/big-data-tutorial

    ReplyDelete
  9. Can not say enough thanks. U rock . but may be relevant or not u can say
    Co-location is not hard by to implement in Hadoop/any file system those who know the internals and till that is not widespread except ETL pre-processing /Unstructured data will not touch open source/hadoop .
    The sorting/redistribution how ever u do it with 2 -3 large data sets it will not work. It is network bound.period.

    ReplyDelete
  10. That is very interesting; you are a very skilled blogger. I have shared your website in my social networks! A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article.

    Hadoop Online Training
    Data Science Online Training

    ReplyDelete
  11. "Nice and good article.. it is very useful for me to learn and understand easily.. thanks for sharing your valuable information and time.. please keep updating.php jobs in hyderabad.
    "

    ReplyDelete
  12. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in MapReduce

    ReplyDelete
  13. Hi,your post on joins are very good i understood about Reduce-side joins in Java map-reduce especially
    the diagrammatic representation was good Hadoop Training in Velachery | Hadoop Training .

    ReplyDelete
  14. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  15. Hi Anagha.
    I am new to Hadoop. I tried running the program on the dataset and am receiving the following error.
    C:\Users\nsita>hadoop jar c:\java\jar\RSJProgram.jar reducesidejoin.DriverRSJ /playground/data/part-e /playground/data/part-sc /playground/data/RSJReduceOutput
    19/01/16 19:49:03 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    19/01/16 19:49:03 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
    19/01/16 19:49:04 INFO input.FileInputFormat: Total input files to process : 2
    19/01/16 19:49:04 INFO mapreduce.JobSubmitter: number of splits:2
    19/01/16 19:49:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547637839810_0014
    19/01/16 19:49:04 INFO impl.YarnClientImpl: Submitted application application_1547637839810_0014
    19/01/16 19:49:04 INFO mapreduce.Job: The url to track the job: http://DESKTOP-JCF7H50:8088/proxy/application_1547637839810_0014/
    19/01/16 19:49:04 INFO mapreduce.Job: Running job: job_1547637839810_0014
    19/01/16 19:49:12 INFO mapreduce.Job: Job job_1547637839810_0014 running in uber mode : false
    19/01/16 19:49:12 INFO mapreduce.Job: map 0% reduce 0%
    19/01/16 19:49:18 INFO mapreduce.Job: Task Id : attempt_1547637839810_0014_m_000000_0, Status : FAILED
    Error: java.lang.NumberFormatException: null
    at java.lang.Integer.parseInt(Unknown Source)
    at java.lang.Integer.parseInt(Unknown Source)
    at reducesidejoin.MapperRSJ.setup(MapperRSJ.java:35)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Unknown Source)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)

    I feel the error is emitted from the MapperRSJ but I am not able to rectify the problem. Would be able to help me out here

    intSrcIndex = Integer.parseInt(context.getConfiguration().get(fsFileSplit.getPath().getName()));

    ReplyDelete
  16. QuickBooks Premier really is easy to make use of but errors may usually pop up during the time of installation, during the time of taking backup, while upgrading your software to your latest version etc. The support team at QuickBooks Help & Support is trained by well experienced experts that are making our customer care executives quite robust and resilient.

    ReplyDelete
  17. If you need the help or even the information about it, our company has arrived now to do business with you with complete guidance combined with demo. Connect to us anytime anywhere. Only just contact us at Quickbooks Payroll Support Number . Our experts professional have provided a lot of the required and resolve all type of issues related to payroll.

    ReplyDelete
  18. No matter if you're getting performance errors or perhaps you are facing any kind of trouble to upgrade your software to its latest version, you can easily quickly get help with
    Quickbooks Support Number. Every time you dial QuickBooks 2018 technical support phone number, your queries get instantly solved. Moreover, you will get in touch with our professional technicians via our email and chat support options for prompt resolution of all of the related issues.

    ReplyDelete
  19. It signifies you could access our QuickBooks Customer Support Number for QuickBooks at any time. Our backing team is dedicated enough to bestow you with end-to-end QuickBooks solutions once you want to procure them for each and every QuickBooks query.

    ReplyDelete
  20. QuickBooks Enterprise Support Phone Number is assisted by an organization this is certainly totally dependable. It is a favorite proven fact that QuickBooks has had about plenty of improvement

    ReplyDelete
  21. QuickBooks Customer Support Number Services provide approaches to your entire QuickBooks problem and also assists in identifying the errors with QuickBooks data files and diagnose them thoroughly before resolving these issues.

    ReplyDelete
  22. Our QB Experts are pretty familiar with all of the versions of QuickBooks Enterprise Tech Support Number released in the market till now by Intuit. So whether it is choosing the best suited version of QB Enterprise to your requirements or assessing the sorts of errors that are usually encountered to the various versions of QB Enterprise.

    ReplyDelete
  23. QuickBooks Desktop version is oftentimes additionally divided in to QuickBooks professional, QuickBooks Premier and QuickBooks Enterprise Support Phone Number. you’ll get the version that can be additional apt for your business.

    ReplyDelete
  24. Our dedicated technical team is available to help you to 24X7, 365 days per year to ensure comprehensive support and services at any hour. We assure you the fastest solution of many your QuickBooks Tech Support Number Usa software related issues.

    ReplyDelete
  25. Our QuickBooks Tech Support Number channel- We comprehend the complexity and need using this accounting software in day to day life. You can’t be cautious about more or less time for it to get a fix of each and every single QB error.

    ReplyDelete
  26. QuickBooks offers a number of features to trace your startup business. Day by day it is getting popular amonst the businessmen and entrepreneurs. But with the increasing popularity, QuickBooks is meeting a lot of technical glitches. And here we show up with our smartest solutions. Have a look at the problem list and once you face any of them just call QuickBooks Tech Support Number for the assistance. We are going to help you with…

    ReplyDelete
  27. QuickBooks, an application solution that will be developed in such a manner that you can manage payroll, inventory, sales and each other need of your small business. Each QuickBooks software solution is developed relating to different industries and their demands to be able to seamlessly manage your entire business finance at any time and in one go. No need to worry if you are stuck with QuickBooks issue in midnight as our technical specialists at QuickBooks Tech Support Number can be obtained round the clock to serve you because of the best optimal solution right away.

    ReplyDelete
  28. QucikBooks Enterprise Technical Support has almost eliminated the typical accounting process. Along with a wide range of tools and automations, it provides a wide range of industry verticals with specialized

    ReplyDelete
  29. For such kind of information, be always in contact with us through our blogs. To locate the reliable way to obtain assist to create customer checklist in QB desktop, QuickBooks online and intuit online payroll? Our Support for QuickBooks Payroll may help you better.

    ReplyDelete
  30. Before installing the software, you should make sure that you make the backup of the documents and files so that the data don’t get lost. QuickBooks Phone Number For Support You can recover the data after the installation of the updated version of the software.

    ReplyDelete
  31. This is the software which would enable you to write the cheques with unscheduled payroll easily. The company and you think you won’t be able to give time to the payroll system, then you can choose to download the QuickBooks Tech Support Phone Number software.

    ReplyDelete
  32. QuickBooks Enterprise Support channel available on a call at .You can quickly avail our other beneficial technical support services easily QuickBooks Enterprise Support Contact Number merely a single call definately not you.

    ReplyDelete
  33. QuickBooks encounter an amount of undesirable and annoying errors which keep persisting with time if you do not resolved instantly. Certainly one of such QuickBooks issue is Printer issue which mainly arises as a result of a number of hardware and software problems in QuickBooks, printer or drivers. You're able to resolve this error by using the below troubleshooting steps you can also simply contact our QuickBook Tech Support Phone Number available at.You should run QuickBooks print and pdf repair tool to determine and fix the errors in printer settings prior to starting the troubleshooting.

    ReplyDelete
  34. You might be always able to relate with us at our QuickBooks Payroll Tech Support to extract the very best support services from our highly dedicated and supportive QuickBooks Support executives at any point of time as all of us is oftentimes prepared to work with you. Most of us is responsible and makes sure to deliver hundred percent assistance by working 24*7 to suit your needs. Go ahead and mail us at our quickbooks support email id whenever you are in need. You could reach us via call at our toll-free number.

    ReplyDelete
  35. This can make your QuickBooks payroll software accurate. You won’t have any stress in operation. Even for small companies we operate. This technique is wonderful for a medium-sized company. You can find the absolute most wonderful financial tool. QuickBooks Payroll Support Number is present 24/7. You can actually call them anytime. The experts are thrilled to aid.

    ReplyDelete
  36. QuickBook Support Phone Number, QuickBooks is available for users around the world whilst the best tool to provide creative and innovative features for business account management to small and medium-sized business organizations.

    ReplyDelete
  37. By using QuickBooks Payroll Support Phone Number, you're able to create employee payment on time. However in any case, you might be facing some problem when making use of QuickBooks payroll such as for instance issue during installation, data integration error, direct deposit issue, file taxes, and paychecks errors, installation or up-gradation or simply just about some other than you don’t panic, we provide quality QuickBooks Payroll help service. Here are some features handle by our QB online payroll service.

    ReplyDelete
  38. They encounter with HP Printer issues. To solve exactly the same, an individual can contact the HP Printer Support Number helpline wide range of HP Customer Service. The professionals would guide the consumer with all the troubleshooting steps and ways to fix the exact same.

    ReplyDelete

  39. You are able to dial the QuickBooks Customer Support Phone Number to possess a spoken language using the QuickBooks Specialists otherwise you can even talk to them by victimization the chat choice on our internet site.

    ReplyDelete
  40. QuickBooks software program is only manufactured by the Intuit for small and medium-size businesses. With this QuickBooks Support Number best and great accounting software, you can easily and quickly track your business all income and expenses, easily track your payments, sales, and inventory,

    ReplyDelete
  41. The principal intent behind QuickBook Tech Support is to provide the technical help 24*7 so as in order to avoid wasting your productivity hours. This is completely a toll-free QuickBooks client Service variety that you won’t pay any call charges. Of course, QuickBooks is one among the list of awesome package in the company world. The accounting part of the many companies varies according to this package. You'll find so many fields it covers like creating invoices, managing taxes, managing payroll etc. However exceptions are typical over, sometimes it generates the down sides and user wants QuickBooks client Service help.

    ReplyDelete
  42. Advanced Financial Reports: an individual can surely get generate real-time basis advanced reports with the help of QuickBooks Support. If one is certainly not known of this feature, then, you can easily call our QuickBooks Help Number. They are going to surely provide you the necessary information for your requirements.

    ReplyDelete

  43. Sometimes, many QuickBooks Support Phone Number users face unexpected issues such as for example simply related to QuickBooks online accountant once they just grow their practice for business. And also, some issues linked to QuickBooks company file, QuickBooks email service and heavy and unexpected QuickBooks error 1603 and many other.

    ReplyDelete
  44. The QuickBooks Payoll Tech Support Phone Number Desktop version offers a hand filled with services. We should do the installation on our desktop to possess it working then. It boils down with three types of services basic, enhanced and assisted. Basic payroll is most affordable amongst all the three service types. Enhanced is a tad bit more expensive then basic and the most high-priced one is assisted

    ReplyDelete

  45. These are just a number of the QuickBooks Payroll Tech Support Number errors which have been explained above. Errors are a common thing with computer programs and chances are that programs may have functionality problems when utilized on a PC, Mac, Smartphone or Tablet.

    ReplyDelete
  46. Every user can get 24/7 support services with your online technical experts using QuickBooks support phone number. When you’re stuck in times where you can’t find a method to eliminate a concern, all that's necessary is always to dial QuickBooks Support Phone Number. Be patient; they will inevitably and instantly solve your queries.

    ReplyDelete
  47. Are you an entrepreneur and need a reliable and accurate accounting software then Quickbooks accounting software is the best accounting software in the United States. The QuickBooks Support Number is toll-free and the professional technicians handling your support call can come up with an immediate solution that can permanently solve the glitches.

    ReplyDelete
  48. QuickBooks Enterprise is a highly advanced software suit that gives you more data handling capacity, more advanced and improved inventory management features and support for handling multiple entities at a time. This software suit is ideal for businesses that have outgrown the entry level accounting software requirements and are now looking for something more powerful and more feature rich to handle more business functions in a much lesser time. The moment you get in touch with QuickBooks Enterprise Support Number, our world class team of QB Enterprise Experts will be right at your disposal to comprehend the nature of support needed for addressing your issue.

    ReplyDelete
  49. While running Windows, you observe a clumsy response or slow reaction to mouse or keyboard input.
    Your computer system has a tendency to freeze for a couple of of minutes that can cause you lots of issues while you're in a rush to perform your task. If you would like to learn How To Fix Quickbooks Error 9999 yourself, you can continue reading this blog.

    ReplyDelete
  50. Runtime errors happen without warning. The error message can come up the screen anytime QuickBooks is run. In fact, the error message or some other dialogue box can come up again and again if not addressed early on. If you would like to learn how to Troubleshoot Quickbooks Error 9999, you can continue reading this blog.

    ReplyDelete
  51. Very well explained blog. It is very helpful for learner. click here for
    QuickBooks POS Support Phone Number USA to get 24*7 QuickBooks customer Service and QuickBooks technical support for more detail dial on our QuickBooks support number 844-908-0801

    ReplyDelete
  52. Ensure flawless benefits of QuickBooks via QuickBooks Toll Free Phone Number. If you get stuck somewhere in between while performing eminent accounting tasks. Don’t need to worry about it. Just procure reliable assistance by dialling on the Toll Free Number 855-9O7-O4O6.

    ReplyDelete

  53. Best accounting Software QuickBooks Provide 24*7 support for the best outcome for there user,Just get in touch with QuickBooks experts.
    Dial QuickBooks technical support Phone Number 1-844-908-0801
    See us on Map :: https://g.page/QB-Customer-Number?gm

    ReplyDelete
  54. nice post!
    Worried About QuickBooks Support ?Get in touch with QuickBooks expert for instant solution.
    Dial QuickBooks Customer Service 1-844-908-0801

    ReplyDelete
  55. nice post!
    Worried About QuickBooks Support ?Get in touch with QuickBooks expert for instant solution.
    Dial QuickBooks POS support Phone Number +1-888-603-9404

    ReplyDelete
  56. Very nice article,Thank you for sharing this awesome article with us.


    keep updating...


    Big Data Hadoop Course

    ReplyDelete
  57. I think you did an awesome job explaining Amu ba 3rd Year time table it. Sure beats having to research it on Bcom 3rd Year Time Table my own. Thanks

    ReplyDelete
  58. Quickbooks File Doctor is an especially worked instrument to fix a error like Quickbooks error 15106 and hurt that prevents the customer to open the association record. For the present circumstance, the Built-in version of Quickbooks is more careful than the autonomous.

    ReplyDelete
  59. Hey! Mind-blowing blog. Keep writing such beautiful blogs. In case you are struggling with issues on QuickBooks software, dial QuickBooks Customer Service Number . The team, on the other end, will assist you with the best technical services.

    ReplyDelete