1.0. What's in this blog?
1. Introduction to map file
2. Sample code to convert a text file to a map file
3. Sample code to read a map file
2.0. What's a Map File?
2.0.1. Definition:
From Hadoop the Definitive Guide..
A MapFile is a sorted SequenceFile with an index to permit lookups by key. MapFile can be thought of as a persistent form of java.util.Map (although it doesn’t implement this interface), which is able to grow beyond the size of a Map that is kept in memory.Apache documentation:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/MapFile.html
2.0.2. Datatypes:
The keys must be instances of WritableComparable, and the values, Writable.
2.0.3. Creating map files:
Create an instance of MapFile.Writer and call append(), to add key-values, in order.
2.0.4. Looking up data in map files:
Create an instance of MapFile.Reader, and call get(key,value).
2.0.5. Construct
The map file is actually a directory. Within the same, there is an "index" file, and a "data" file.
The data file is a sequence file and has keys and associated values.
The data file is a sequence file and has keys and associated values.
The index file is smaller, has key value pairs with the key being the actual key of the data, and the value, the byte offset. The index file has a fraction of the keys and is determined by MapFile.Writer.GetIndexInterval().
2.0.5.1. Directory structure:
$ hadoop fs -ls formatProject/data/departments_map | awk '{print $8}'
formatProject/data/departments_map/data
formatProject/data/departments_map/index
2.0.5.2. Content of the file 'data':
$ hadoop fs -text formatProject/data/departments_map/data
d001 Marketing
d002 Finance
d003 Human Resources
d004 Production
d005 Development
d006 Quality Management
d007 Sales
d008 Research
d009 Customer Service
2.0.5.3. Content of the file 'index':
$ hadoop fs -text formatProject/data/departments_map/index
d001 121
d002 152
d003 181
d004 218
d005 250
d006 283
d007 323
d008 350
d009 380
2.0.5.1. Directory structure:
$ hadoop fs -ls formatProject/data/departments_map | awk '{print $8}'
formatProject/data/departments_map/data
formatProject/data/departments_map/index
2.0.5.2. Content of the file 'data':
$ hadoop fs -text formatProject/data/departments_map/data
d001 Marketing
d002 Finance
d003 Human Resources
d004 Production
d005 Development
d006 Quality Management
d007 Sales
d008 Research
d009 Customer Service
2.0.5.3. Content of the file 'index':
$ hadoop fs -text formatProject/data/departments_map/index
d001 121
d002 152
d003 181
d004 218
d005 250
d006 283
d007 323
d008 350
d009 380
2.0.6. Behind the scenes of a look up
The index file is read into memory, the key less than or equal to the one being looked up is (binary) searched for, and the reader seeks to this key and reads up to key being looked up, extracts and returns the value associated with the key. Returns a null if the key is not found.
If the map file is too large to load into memory, there are configurations that can be set to skip keys in the index.
2.0.7. Usage
Fast lookups - in joins, among others.
Can also be used as a container for small files, with the filename as the key.
Can also be used as a container for small files, with the filename as the key.
3.0. Creating a map file
4.0. Looking up a key in a map file
Covered already in the gist under section 3.
The plan is to use the map file in a map-side join in a subsequent blog.
5.0. Any thoughts/comments
Any constructive criticism and/or additions/insights is much appreciated.
Cheers!!
Great Post Anagha...thnx for explaining in detail..
ReplyDeleteThe explanations are clear and the overall program looks good. It would be nice to give an example of reading a mapfile.tar.gz with java the problem being that your Reduce-side joins in Java map-reduce uses a map.tar.gz file but cannot find keys in it at least on Cloudera-Training-VM-4.1.0.a-vmware. -Tris
ReplyDeleteHey!
ReplyDeleteOn running this code getting this error:-
Any insights on this? Thanks in advance
/$ hadoop jar mapf.jar FormatConverterTextToMap /user/input/inp.txt /user/output
16/04/10 11:01:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/10 11:01:26 INFO compress.CodecPool: Got brand-new compressor [.deflate]
16/04/10 11:01:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
Map file created successfully!!
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at FormatConverterTextToMap.main(FormatConverterTextToMap.java:35)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Nice)
ReplyDeletethakyou it vry nice blog for beginners
ReplyDeletehttps://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
ReplyDeletehttps://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/