Hooked on Hadoop: Map File - construct, usage, code samples

Thursday, September 12, 2013

Map File - construct, usage, code samples

This post covers, map file format, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!

1.0. What's in this blog?

1. Introduction to map file

2. Sample code to convert a text file to a map file

3. Sample code to read a map file

2.0. What's a Map File?

2.0.1. Definition:

From Hadoop the Definitive Guide..

A MapFile is a sorted SequenceFile with an index to permit lookups by key. MapFile can be thought of as a persistent form of java.util.Map (although it doesn’t implement this interface), which is able to grow beyond the size of a Map that is kept in memory.

Apache documentation:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/MapFile.html

2.0.2. Datatypes:

The keys must be instances of WritableComparable, and the values, Writable.

2.0.3. Creating map files:

Create an instance of MapFile.Writer and call append(), to add key-values, in order.

2.0.4. Looking up data in map files:

Create an instance of MapFile.Reader, and call get(key,value).

2.0.5. Construct

The map file is actually a directory. Within the same, there is an "index" file, and a "data" file.
The data file is a sequence file and has keys and associated values.

The index file is smaller, has key value pairs with the key being the actual key of the data, and the value, the byte offset. The index file has a fraction of the keys and is determined by MapFile.Writer.GetIndexInterval().

2.0.5.1. Directory structure:
$ hadoop fs -ls formatProject/data/departments_map | awk '{print $8}'
formatProject/data/departments_map/data
formatProject/data/departments_map/index

2.0.5.2. Content of the file 'data':
$ hadoop fs -text formatProject/data/departments_map/data
d001 Marketing
d002 Finance
d003 Human Resources
d004 Production
d005 Development
d006 Quality Management
d007 Sales
d008 Research
d009 Customer Service

2.0.5.3. Content of the file 'index':
$ hadoop fs -text formatProject/data/departments_map/index
d001 121
d002 152
d003 181
d004 218
d005 250
d006 283
d007 323
d008 350
d009 380

2.0.6. Behind the scenes of a look up

The index file is read into memory, the key less than or equal to the one being looked up is (binary) searched for, and the reader seeks to this key and reads up to key being looked up, extracts and returns the value associated with the key. Returns a null if the key is not found.

If the map file is too large to load into memory, there are configurations that can be set to skip keys in the index.

2.0.7. Usage

Fast lookups - in joins, among others.
Can also be used as a container for small files, with the filename as the key.

3.0. Creating a map file

4.0. Looking up a key in a map file

Covered already in the gist under section 3.

The plan is to use the map file in a map-side join in a subsequent blog.

5.0. Any thoughts/comments

Any constructive criticism and/or additions/insights is much appreciated.

Cheers!!

6 comments:

Sreeram Madhu ChintalapudiMarch 16, 2014 at 11:16 AM
Great Post Anagha...thnx for explaining in detail..
ReplyDelete
Replies
regzfenJune 6, 2014 at 3:17 PM
The explanations are clear and the overall program looks good. It would be nice to give an example of reading a mapfile.tar.gz with java the problem being that your Reduce-side joins in Java map-reduce uses a map.tar.gz file but cannot find keys in it at least on Cloudera-Training-VM-4.1.0.a-vmware. -Tris
ReplyDelete
Replies
userApril 9, 2016 at 10:56 PM
Hey!
On running this code getting this error:-
Any insights on this? Thanks in advance
/$ hadoop jar mapf.jar FormatConverterTextToMap /user/input/inp.txt /user/output
16/04/10 11:01:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/10 11:01:26 INFO compress.CodecPool: Got brand-new compressor [.deflate]
16/04/10 11:01:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
Map file created successfully!!
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at FormatConverterTextToMap.main(FormatConverterTextToMap.java:35)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
ReplyDelete
Replies
UnknownJuly 21, 2017 at 5:38 AM
Nice)
ReplyDelete
Replies
UnknownJuly 8, 2018 at 5:15 AM
thakyou it vry nice blog for beginners
https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/
ReplyDelete
Replies
RenuAugust 17, 2018 at 2:54 AM
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

ReplyDelete
Replies

Add comment

Hooked on Hadoop

Thursday, September 12, 2013

Map File - construct, usage, code samples

1.0. What's in this blog?

2.0. What's a Map File?

3.0. Creating a map file

4.0. Looking up a key in a map file

5.0. Any thoughts/comments

6 comments:

Search

Blog archive

Popular Posts

Total Pageviews