1.0. What's in this blog?
1. Introduction to map file
2. Sample code to convert a text file to a map file
3. Sample code to read a map file
2.0. What's a Map File?
2.0.1. Definition:
From Hadoop the Definitive Guide..
A MapFile is a sorted SequenceFile with an index to permit lookups by key. MapFile can be thought of as a persistent form of java.util.Map (although it doesn’t implement this interface), which is able to grow beyond the size of a Map that is kept in memory.Apache documentation:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/MapFile.html
2.0.2. Datatypes:
The keys must be instances of WritableComparable, and the values, Writable.
2.0.3. Creating map files:
Create an instance of MapFile.Writer and call append(), to add key-values, in order.
2.0.4. Looking up data in map files:
Create an instance of MapFile.Reader, and call get(key,value).
2.0.5. Construct
The map file is actually a directory. Within the same, there is an "index" file, and a "data" file.
The data file is a sequence file and has keys and associated values.
The data file is a sequence file and has keys and associated values.
The index file is smaller, has key value pairs with the key being the actual key of the data, and the value, the byte offset. The index file has a fraction of the keys and is determined by MapFile.Writer.GetIndexInterval().
2.0.5.1. Directory structure:
$ hadoop fs -ls formatProject/data/departments_map | awk '{print $8}'
formatProject/data/departments_map/data
formatProject/data/departments_map/index
2.0.5.2. Content of the file 'data':
$ hadoop fs -text formatProject/data/departments_map/data
d001 Marketing
d002 Finance
d003 Human Resources
d004 Production
d005 Development
d006 Quality Management
d007 Sales
d008 Research
d009 Customer Service
2.0.5.3. Content of the file 'index':
$ hadoop fs -text formatProject/data/departments_map/index
d001 121
d002 152
d003 181
d004 218
d005 250
d006 283
d007 323
d008 350
d009 380
2.0.5.1. Directory structure:
$ hadoop fs -ls formatProject/data/departments_map | awk '{print $8}'
formatProject/data/departments_map/data
formatProject/data/departments_map/index
2.0.5.2. Content of the file 'data':
$ hadoop fs -text formatProject/data/departments_map/data
d001 Marketing
d002 Finance
d003 Human Resources
d004 Production
d005 Development
d006 Quality Management
d007 Sales
d008 Research
d009 Customer Service
2.0.5.3. Content of the file 'index':
$ hadoop fs -text formatProject/data/departments_map/index
d001 121
d002 152
d003 181
d004 218
d005 250
d006 283
d007 323
d008 350
d009 380
2.0.6. Behind the scenes of a look up
The index file is read into memory, the key less than or equal to the one being looked up is (binary) searched for, and the reader seeks to this key and reads up to key being looked up, extracts and returns the value associated with the key. Returns a null if the key is not found.
If the map file is too large to load into memory, there are configurations that can be set to skip keys in the index.
2.0.7. Usage
Fast lookups - in joins, among others.
Can also be used as a container for small files, with the filename as the key.
Can also be used as a container for small files, with the filename as the key.
3.0. Creating a map file
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This gist demonstrates how to create a map file, from a text file. | |
Includes: | |
--------- | |
1. Input data and script download | |
2. Input data-review | |
3. Data load commands | |
4. Java program to create the map file out of a text file in HDFS | |
5. Command to run Java program | |
6. Results of the program run to create map file | |
7. Java program to lookup data in map file | |
8. Command to run program to do a lookup | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
01. Data and script download | |
----------------------------- | |
Google: | |
<<To be added>> | |
Email me at airawat.blog@gmail.com if you encounter any issues | |
gitHub: | |
<<To be added>> | |
Directory structure | |
------------------- | |
formatProject | |
data | |
departments_sorted | |
part-m-00000 | |
formatConverterTextToMap | |
src | |
FormatConverterTextToMap.java | |
MapFileLookup.java | |
jars | |
formatConverterTextToMap.jar | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
************************************************** | |
Input text file - departments_sorted/part-m-00000 | |
************************************************** | |
$ more formatProject/data/departments_sorted/part-m-00000 | |
d001 Marketing | |
d002 Finance | |
d003 Human Resources | |
d004 Production | |
d005 Development | |
d006 Quality Management | |
d007 Sales | |
d008 Research | |
d009 Customer Service | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
********************************************** | |
hdfs load commands | |
********************************************** | |
# Load data | |
$ hadoop fs -put formatProject/ | |
# Remove unnecessary files | |
$ hadoop fs -rm -R formatProject/formatConverterTextToMap/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/****************************************** | |
* FormatConverterTextToMap.java | |
* ****************************************/ | |
import java.io.IOException; | |
import org.apache.hadoop.io.Text; | |
import org.apache.hadoop.io.MapFile; | |
import org.apache.hadoop.conf.Configuration; | |
import org.apache.hadoop.fs.FileSystem; | |
import org.apache.hadoop.fs.Path; | |
import org.apache.hadoop.io.IOUtils; | |
import org.apache.hadoop.fs.FSDataInputStream; | |
public class FormatConverterTextToMap { | |
@SuppressWarnings("deprecation") | |
public static void main(String[] args) throws IOException{ | |
Configuration conf = new Configuration(); | |
FileSystem fs; | |
try { | |
fs = FileSystem.get(conf); | |
Path inputFile = new Path(args[0]); | |
Path outputFile = new Path(args[1]); | |
Text txtKey = new Text(); | |
Text txtValue = new Text(); | |
String strLineInInputFile = ""; | |
String lstKeyValuePair[] = null; | |
MapFile.Writer writer = null; | |
FSDataInputStream inputStream = fs.open(inputFile); | |
try { | |
writer = new MapFile.Writer(conf, fs, outputFile.toString(), | |
txtKey.getClass(), txtKey.getClass()); | |
writer.setIndexInterval(1);//Need this as the default is 128, and my data is just 9 records | |
while (inputStream.available() > 0) { | |
strLineInInputFile = inputStream.readLine(); | |
lstKeyValuePair = strLineInInputFile.split("\\t"); | |
txtKey.set(lstKeyValuePair[0]); | |
txtValue.set(lstKeyValuePair[1]); | |
writer.append(txtKey, txtValue); | |
} | |
} finally { | |
IOUtils.closeStream(writer); | |
System.out.println("Map file created successfully!!"); | |
} | |
} catch (IOException e) { | |
e.printStackTrace(); | |
} | |
} | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
****************************************************************** | |
**Command to run program that creates a map file from text file | |
****************************************************************** | |
$ hadoop jar formatProject/formatConverterTextToMap/jars/formatConverterTextToMap.jar FormatConverterTextToMap formatProject/data/departments_sorted/part-m-00000 formatProject/data/departments_map | |
13/09/12 22:05:21 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library | |
13/09/12 22:05:21 INFO compress.CodecPool: Got brand-new compressor [.deflate] | |
13/09/12 22:05:21 INFO compress.CodecPool: Got brand-new compressor [.deflate] | |
Map file created successfully!! | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
************************************************ | |
**Results | |
************************************************ | |
$ hadoop fs -ls formatProject/data/departments_map | awk '{print $8}' | |
formatProject/data/departments_map/data | |
formatProject/data/departments_map/index | |
$ hadoop fs -text formatProject/data/departments_map/data | |
13/09/12 22:44:34 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library | |
13/09/12 22:44:34 INFO compress.CodecPool: Got brand-new decompressor [.deflate] | |
d001 Marketing | |
d002 Finance | |
d003 Human Resources | |
d004 Production | |
d005 Development | |
d006 Quality Management | |
d007 Sales | |
d008 Research | |
d009 Customer Service | |
$ hadoop fs -text formatProject/data/departments_map/index | |
13/09/12 22:44:56 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library | |
13/09/12 22:44:56 INFO compress.CodecPool: Got brand-new decompressor [.deflate] | |
13/09/12 22:44:56 INFO compress.CodecPool: Got brand-new decompressor [.deflate] | |
13/09/12 22:44:56 INFO compress.CodecPool: Got brand-new decompressor [.deflate] | |
13/09/12 22:44:56 INFO compress.CodecPool: Got brand-new decompressor [.deflate] | |
d001 121 | |
d002 152 | |
d003 181 | |
d004 218 | |
d005 250 | |
d006 283 | |
d007 323 | |
d008 350 | |
d009 380 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/**************************************** | |
* MapFileLookup.java | |
* **************************************/ | |
import java.io.IOException; | |
import org.apache.hadoop.fs.FileSystem; | |
import org.apache.hadoop.io.MapFile; | |
import org.apache.hadoop.io.Text; | |
import org.apache.hadoop.conf.Configuration; | |
public class MapFileLookup { | |
/* | |
This program looks up a map file for a certain key and returns the associated value | |
The call to this program is: | |
Parameters: | |
param 1: Path to map file | |
param 2: Key for which we want to get the value from the map file | |
Return: The value for the key | |
Return type: Text | |
Sample call: hadoop jar MapFileLookup.jar MapFileLookup <map-file-directory> <key> | |
*/ | |
@SuppressWarnings("deprecation") | |
public static Text main(String[] args) throws IOException { | |
Configuration conf = new Configuration(); | |
FileSystem fs = null; | |
Text txtKey = new Text(args[1]); | |
Text txtValue = new Text(); | |
MapFile.Reader reader = null; | |
try { | |
fs = FileSystem.get(conf); | |
try { | |
reader = new MapFile.Reader(fs, args[0].toString(), conf); | |
reader.get(txtKey, txtValue); | |
} catch (IOException e) { | |
e.printStackTrace(); | |
} | |
} catch (IOException e) { | |
e.printStackTrace(); | |
} finally { | |
if(reader != null) | |
reader.close(); | |
} | |
System.out.println("The key is " + txtKey.toString() | |
+ " and the value is " + txtValue.toString()); | |
return txtValue; | |
} | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
************************************************************************** | |
**Commands to run program to look up a key in a map file from text file | |
************************************************************************** | |
$ hadoop jar formatProject/formatConverterTextToMap/jars/MapFileLookup.jar MapFileLookup formatProject/data/departments_map "d009" | |
13/09/12 22:53:08 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library | |
13/09/12 22:53:08 INFO compress.CodecPool: Got brand-new decompressor [.deflate] | |
13/09/12 22:53:08 INFO compress.CodecPool: Got brand-new decompressor [.deflate] | |
13/09/12 22:53:08 INFO compress.CodecPool: Got brand-new decompressor [.deflate] | |
13/09/12 22:53:08 INFO compress.CodecPool: Got brand-new decompressor [.deflate] | |
13/09/12 22:53:08 INFO compress.CodecPool: Got brand-new decompressor [.deflate] | |
The key is d009 and the value is Customer Service |
4.0. Looking up a key in a map file
Covered already in the gist under section 3.
The plan is to use the map file in a map-side join in a subsequent blog.
5.0. Any thoughts/comments
Any constructive criticism and/or additions/insights is much appreciated.
Cheers!!
Great Post Anagha...thnx for explaining in detail..
ReplyDeleteThe explanations are clear and the overall program looks good. It would be nice to give an example of reading a mapfile.tar.gz with java the problem being that your Reduce-side joins in Java map-reduce uses a map.tar.gz file but cannot find keys in it at least on Cloudera-Training-VM-4.1.0.a-vmware. -Tris
ReplyDeleteHey!
ReplyDeleteOn running this code getting this error:-
Any insights on this? Thanks in advance
/$ hadoop jar mapf.jar FormatConverterTextToMap /user/input/inp.txt /user/output
16/04/10 11:01:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/10 11:01:26 INFO compress.CodecPool: Got brand-new compressor [.deflate]
16/04/10 11:01:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
Map file created successfully!!
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at FormatConverterTextToMap.main(FormatConverterTextToMap.java:35)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Nice)
ReplyDeletethakyou it vry nice blog for beginners
ReplyDeletehttps://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
ReplyDeletehttps://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/