Friday, September 13, 2013

Sequence File - construct, usage, code samples

This post covers, sequence file format, has links to Apache documentation, my notes on the topic and my sample program demonstrating the functionality. Feel free to share any insights or constructive criticism. Cheers!!

1.0. What's in this blog?


1.  Introduction to sequence file format
2.  Sample code to create a sequence file (compressed and uncompressed), from a text file, in a map reduce program, and to read a sequence file.

2.0. What's a Sequence File?


2.0.1. About sequence files:
A sequence file is a persistent data structure for binary key-value pairs.

2.0.2. Construct:
Sequence files have sync points included after every few records, that align with record boundaries, aiding the reader to sync.  The sync points support splitting of files for mapreduce operations.  Sequence files support record-level and block-level compression.

Apache documentation: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html

Excerpts from Hadoop the definitive guide...
"A sequence file consists of a header followed by one or more records. The first three bytes of a sequence file are the bytes SEQ, which acts as a magic number, followed by a single byte representing the version number. The header contains other fields, including the names of the key and value classes, compression details, user-defined metadata, and the sync marker. 
Structure of sequence file with and without record compression-
















The format for record compression is almost identical to no compression, except the value bytes are compressed using the codec defined in the header. Note that keys are not compressed.
Structure of sequence file with and without block compression-











Block compression compresses multiple records at once; it is therefore more compact than and should generally be preferred over record compression because it has the opportunity to take advantage of similarities between records. A sync marker is written before the start of every block. The format of a block is a field indicating the number of records in the block, followed by four compressed fields: the key lengths, the keys, the value lengths, and the values."

The uncompressed, record-compressed and block-compressed sequence files, share the same header.  Details are below, from the Apache documentation, on sequence files.

SequenceFile Header
  • version - 3 bytes of magic header SEQ, followed by 1 byte of actual version number (e.g. SEQ4 or SEQ6)
  • keyClassName -key class
  • valueClassName - value class
  • compression - A boolean which specifies if compression is turned on for keys/values in this file.
  • blockCompression - A boolean which specifies if block-compression is turned on for keys/values in this file.
  • compression codec - CompressionCodec class which is used for compression of keys and/or values (if compression is enabled).
  • metadata - SequenceFile.Metadata for this file.
  • sync - A sync marker to denote end of the header.
Uncompressed SequenceFile Format
  • Header
  • Record
  • Record length
  • Key length
  • Key
  • Value
  • A sync-marker every few 100 bytes or so.
Record-Compressed SequenceFile Format
  • Header
  • Record
  • Record length
  • Key length
  • Key
  • Compressed Value
  • A sync-marker every few 100 bytes or so.
Block-Compressed SequenceFile Format
  • Header
  • Record Block
  • Uncompressed number of records in the block
  • Compressed key-lengths block-size
  • Compressed key-lengths block
  • Compressed keys block-size
  • Compressed keys block
  • Compressed value-lengths block-size
  • Compressed value-lengths block
  • Compressed values block-size
  • Compressed values block
  • A sync-marker every block.
2.0.3. Datatypes: 
The keys and values need not be instances of Writable, just need to support serialization.

2.0.4. Creating sequence files: 
Uncompressed: Create an instance of SequenceFile.Writer and call append(), to add key-values, in order.  For record and block compressed, refer the Apache documentation.  When creating compressed files, the actual compression algorithm used to compress key and/or values can be specified by using the appropriate CompressionCodec.

2.0.5. Reading data in sequence files: 
Create an instance of SequenceFile.Reader, and iterate through the entries using reader.next(key,value).

2.0.6. Usage
- Data storage for key-value type data
- Container for other files
- Efficient from storage perspective (binary), efficient from a mapreduce processing perspective (supports compression, and splitting)

3.0. Creating a sequence file


4.0. Reading a sequence file

Covered already in the gist under section 3.

5.0. Any thoughts/comments

Any constructive criticism and/or additions/insights is much appreciated.

Cheers!!





20 comments:

  1. I want to thank you for the shared code. It helped with my thesis, so I really appreciate it.
    Greetings from chile :)

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. with out implementing sequencefile.write instance how could it work ?

    ReplyDelete
  4. great one. Really helps in understanding sequence file

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Hello. I followed your tutorial and adapted it to hadoop 2.7.1. When i used the SequenceFileInputFormat everithing is ok, but when i tried to read it i got EOFException. I don't know exacly where is the error. I think in 3 possibilities:
    1- I make some mistake in SequenceFileInputFormat;
    2- I make some mistake in SequenceFileOutputFormat;
    3- The custom writable type have a incorrect implementation in public void readFields(DataInput in) method, because the error happen on this method.
    I looking in web for some reason for it but i don't founded any satisfactory answer. May you help me?

    ReplyDelete
  7. thank tou for offering such a nice content very very unique blog .one of the recommanded blog for students and professionals

    Data science training in hyderabad

    Data science training in ameerpet

    ReplyDelete
  8. Nice to see your concept about Hadoop. And good information you have gained and posted here . Ihad also some good inforamation about Hadoop for more topics please visit my sites.
    Hadoop Training In Hyderabad

    ReplyDelete
  9. hi Thanks for helping me to understand Sequence concepts. As a beginner your post help me a lot.
    Hadoop Training in Velachery | Hadoop Training .

    ReplyDelete
  10. CIITN is located in Prime location in Noida having best connectivity via all modes of public transport. CIITN offer both weekend and weekdays courses to facilitate Hadoop aspirants. Among all Hadoop Training Institute in Noida , CIITN's Big Data and Hadoop Certification course is designed to prepare you to match all required knowledge for real time job assignment in the Big Data world with top level companies. CIITN puts more focus in project based training and facilitated with Hadoop 2.7 with Cloud Lab—a cloud-based Hadoop environment lab setup for hands-on experience.

    CIITNOIDA is the good choice for Big Data Hadoop Training in NOIDA in the final year. I have also completed my summer training from here. It provides high quality Hadoop training with Live projects. The best thing about CIITNOIDA is its experienced trainers and updated course content. They even provide you placement guidance and have their own development cell. You can attend their free demo class and then decide.

    Hadoop Training in Noida
    Big Data Hadoop Training in Noida

    ReplyDelete
  11. nice blog, I like your good post, thanks for sharing great information.
    Hadoop Training in Noida

    ReplyDelete

  12. Hi Your Blog is very nice!!

    Get All Top Interview Questions and answers PHP, Magento, laravel,Java, Dot Net, Database, Sql, Mysql, Oracle, Angularjs, Vue Js, Express js, React Js,
    Hadoop, Apache spark, Apache Scala, Tensorflow.

    Mysql Interview Questions for Experienced
    php interview questions for freshers
    php interview questions for experienced
    python interview questions for freshers
    tally interview questions and answers




    ReplyDelete
  13. Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking. Big data hadoop online Training Bangalore

    ReplyDelete
  14. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  15. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  16. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    Big Data Hadoop training in electronic city

    ReplyDelete
  17. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me..Data Science Online Training in Hyderabad

    ReplyDelete