Thursday, September 12, 2013

Map File - construct, usage, code samples

This post covers, map file format, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!

1.0. What's in this blog?

1.  Introduction to map file
2.  Sample code to convert a text file to a map file
3.  Sample code to read a map file

2.0. What's a Map File?

2.0.1. Definition:
From Hadoop the Definitive Guide..
A MapFile is a sorted SequenceFile with an index to permit lookups by key. MapFile can be thought of as a persistent form of java.util.Map (although it doesn’t implement this interface), which is able to grow beyond the size of a Map that is kept in memory.
Apache documentation:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/MapFile.html 

2.0.2. Datatypes: 
The keys must be instances of WritableComparable, and the values, Writable.

2.0.3. Creating map files: 
Create an instance of MapFile.Writer and call append(), to add key-values, in order.

2.0.4. Looking up data in map files: 
Create an instance of MapFile.Reader, and call get(key,value).

2.0.5. Construct
The map file is actually a directory.  Within the same, there is an "index" file, and a "data" file.
The data file is a sequence file and has keys and associated values.
The index file is smaller, has key value pairs with the key being the actual key of the data, and the value, the byte offset.  The index file has a fraction of the keys and is determined by MapFile.Writer.GetIndexInterval().

2.0.5.1. Directory structure:
$ hadoop fs -ls formatProject/data/departments_map | awk '{print $8}'
formatProject/data/departments_map/data
formatProject/data/departments_map/index

2.0.5.2. Content of the file 'data':
$ hadoop fs -text formatProject/data/departments_map/data
d001  Marketing
d002 Finance
d003 Human Resources
d004 Production
d005 Development
d006 Quality Management
d007 Sales
d008 Research
d009 Customer Service

2.0.5.3. Content of the file 'index':
$ hadoop fs -text formatProject/data/departments_map/index
d001 121
d002 152
d003 181
d004 218
d005 250
d006 283
d007 323
d008 350
d009 380

2.0.6. Behind the scenes of a look up
The index file is read into memory, the key less than or equal to the one being looked up is (binary) searched for, and the reader seeks to this key and reads up to key being looked up, extracts and returns the value associated with the key.  Returns a null if the key is not found.

If the map file is too large to load into memory, there are configurations that can be set to skip keys in the index.   

2.0.7. Usage
Fast lookups - in joins, among others.
Can also be used as a container for small files, with the filename as the key.

3.0. Creating a map file

This gist demonstrates how to create a map file, from a text file.
Includes:
---------
1. Input data and script download
2. Input data-review
3. Data load commands
4. Java program to create the map file out of a text file in HDFS
5. Command to run Java program
6. Results of the program run to create map file
7. Java program to lookup data in map file
8. Command to run program to do a lookup
01. Data and script download
-----------------------------
Google:
<<To be added>>
Email me at airawat.blog@gmail.com if you encounter any issues
gitHub:
<<To be added>>
Directory structure
-------------------
formatProject
data
departments_sorted
part-m-00000
formatConverterTextToMap
src
FormatConverterTextToMap.java
MapFileLookup.java
jars
formatConverterTextToMap.jar
**************************************************
Input text file - departments_sorted/part-m-00000
**************************************************
$ more formatProject/data/departments_sorted/part-m-00000
d001 Marketing
d002 Finance
d003 Human Resources
d004 Production
d005 Development
d006 Quality Management
d007 Sales
d008 Research
d009 Customer Service
view raw 02-InputData hosted with ❤ by GitHub
**********************************************
hdfs load commands
**********************************************
# Load data
$ hadoop fs -put formatProject/
# Remove unnecessary files
$ hadoop fs -rm -R formatProject/formatConverterTextToMap/
/******************************************
* FormatConverterTextToMap.java
* ****************************************/
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.MapFile;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.fs.FSDataInputStream;
public class FormatConverterTextToMap {
@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException{
Configuration conf = new Configuration();
FileSystem fs;
try {
fs = FileSystem.get(conf);
Path inputFile = new Path(args[0]);
Path outputFile = new Path(args[1]);
Text txtKey = new Text();
Text txtValue = new Text();
String strLineInInputFile = "";
String lstKeyValuePair[] = null;
MapFile.Writer writer = null;
FSDataInputStream inputStream = fs.open(inputFile);
try {
writer = new MapFile.Writer(conf, fs, outputFile.toString(),
txtKey.getClass(), txtKey.getClass());
writer.setIndexInterval(1);//Need this as the default is 128, and my data is just 9 records
while (inputStream.available() > 0) {
strLineInInputFile = inputStream.readLine();
lstKeyValuePair = strLineInInputFile.split("\\t");
txtKey.set(lstKeyValuePair[0]);
txtValue.set(lstKeyValuePair[1]);
writer.append(txtKey, txtValue);
}
} finally {
IOUtils.closeStream(writer);
System.out.println("Map file created successfully!!");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
******************************************************************
**Command to run program that creates a map file from text file
******************************************************************
$ hadoop jar formatProject/formatConverterTextToMap/jars/formatConverterTextToMap.jar FormatConverterTextToMap formatProject/data/departments_sorted/part-m-00000 formatProject/data/departments_map
13/09/12 22:05:21 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/09/12 22:05:21 INFO compress.CodecPool: Got brand-new compressor [.deflate]
13/09/12 22:05:21 INFO compress.CodecPool: Got brand-new compressor [.deflate]
Map file created successfully!!
view raw 05-RunProgram hosted with ❤ by GitHub
************************************************
**Results
************************************************
$ hadoop fs -ls formatProject/data/departments_map | awk '{print $8}'
formatProject/data/departments_map/data
formatProject/data/departments_map/index
$ hadoop fs -text formatProject/data/departments_map/data
13/09/12 22:44:34 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/09/12 22:44:34 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
d001 Marketing
d002 Finance
d003 Human Resources
d004 Production
d005 Development
d006 Quality Management
d007 Sales
d008 Research
d009 Customer Service
$ hadoop fs -text formatProject/data/departments_map/index
13/09/12 22:44:56 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/09/12 22:44:56 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
13/09/12 22:44:56 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
13/09/12 22:44:56 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
13/09/12 22:44:56 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
d001 121
d002 152
d003 181
d004 218
d005 250
d006 283
d007 323
d008 350
d009 380
view raw 06-Results hosted with ❤ by GitHub
/****************************************
* MapFileLookup.java
* **************************************/
import java.io.IOException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.MapFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
public class MapFileLookup {
/*
This program looks up a map file for a certain key and returns the associated value
The call to this program is:
Parameters:
param 1: Path to map file
param 2: Key for which we want to get the value from the map file
Return: The value for the key
Return type: Text
Sample call: hadoop jar MapFileLookup.jar MapFileLookup <map-file-directory> <key>
*/
@SuppressWarnings("deprecation")
public static Text main(String[] args) throws IOException {
Configuration conf = new Configuration();
FileSystem fs = null;
Text txtKey = new Text(args[1]);
Text txtValue = new Text();
MapFile.Reader reader = null;
try {
fs = FileSystem.get(conf);
try {
reader = new MapFile.Reader(fs, args[0].toString(), conf);
reader.get(txtKey, txtValue);
} catch (IOException e) {
e.printStackTrace();
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if(reader != null)
reader.close();
}
System.out.println("The key is " + txtKey.toString()
+ " and the value is " + txtValue.toString());
return txtValue;
}
}
view raw 07-ReadMapFile hosted with ❤ by GitHub
**************************************************************************
**Commands to run program to look up a key in a map file from text file
**************************************************************************
$ hadoop jar formatProject/formatConverterTextToMap/jars/MapFileLookup.jar MapFileLookup formatProject/data/departments_map "d009"
13/09/12 22:53:08 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/09/12 22:53:08 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
13/09/12 22:53:08 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
13/09/12 22:53:08 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
13/09/12 22:53:08 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
13/09/12 22:53:08 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
The key is d009 and the value is Customer Service

4.0. Looking up a key in a map file

Covered already in the gist under section 3.
The plan is to use the map file in a map-side join in a subsequent blog.

5.0. Any thoughts/comments

Any constructive criticism and/or additions/insights is much appreciated.

Cheers!!

6 comments:

  1. Great Post Anagha...thnx for explaining in detail..

    ReplyDelete
  2. The explanations are clear and the overall program looks good. It would be nice to give an example of reading a mapfile.tar.gz with java the problem being that your Reduce-side joins in Java map-reduce uses a map.tar.gz file but cannot find keys in it at least on Cloudera-Training-VM-4.1.0.a-vmware. -Tris

    ReplyDelete
  3. Hey!
    On running this code getting this error:-
    Any insights on this? Thanks in advance
    /$ hadoop jar mapf.jar FormatConverterTextToMap /user/input/inp.txt /user/output
    16/04/10 11:01:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    16/04/10 11:01:26 INFO compress.CodecPool: Got brand-new compressor [.deflate]
    16/04/10 11:01:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
    Map file created successfully!!
    Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
    at FormatConverterTextToMap.main(FormatConverterTextToMap.java:35)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

    ReplyDelete
  4. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  5. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete