SVDS logo

Format Wars:
From VHS and Beta to Avro and Parquet

There are several data formats to choose from to load your data into the Hadoop Distributed File System (HDFS). Each of the data formats has its own strengths and weaknesses, and understanding the trade-offs will help you choose a data format that fits your system and goals.

On this page we’ll provide you with a quick overview of the some of the most popular data formats and we'll share the results of a series of tests we performed to compare them.

Overview on Data Formats

Text Sequence Avro Parquet ORC

Rather than thinking about free format text files, in the Hadoop ecosystem we are used to thinking about delimited files such as csv, and tsv. We can also think about json records as long as each line is its own jason datum.

Text files are a convenient format to use to exchange data between Hadoop and external applications and systems that produce and/or read delimited text files.

  • Text files are human readable and easily parsable
  • Text files are slow to read and write.
  • Data size is relatively bulky and not as efficient to query.
  • No metadata is stored in the text files so we need to know how the structure of the fields.
  • Text files are not splittable after compression
  • Limited support for schema evolution: new fields can only be appended at the end of the records and existing fields can never be removed.

Apache Hadoops’s SequenceFile provides a persistent data structure for binary key-value pairs. Hadoop provides with instances and methods so that we can write key-value pairs in SequenceFiles.

There are 3 different formats for SequenceFiles depending on the Compression Type specified:

  • Uncompressed format
  • Record compressed format
  • Block compressed format

The SequenceFile is the base data structure for the other types of files like MapFile, SetFile, ArrayFile, and BloomMapFile.

Some of the common use cases include: the transferring of data between Map Reduce jobs and as an archive/container to pack small Hadoop files where the metadata(filename, path, creation time) is stored as the key and the file contents are stored as the value.

  • Row-based
  • More compact than text files
  • You can’t perform specified key editing, adding, removal: files are append only
  • Encapsulated into the hadoop environment
  • Support splitting even when the data inside the file is compressed
  • The sequence file reader will read until a sync marker is reached ensuring that a record is read as a whole
  • Sequence files do not store metadata, so the only schema evolution option is appending new fields
Sequence File Structure

Apache Avro is widely used as a serialization platform, as it is interoperable across multiple languages, offers a compact and fast binary format, supports dynamic schema discovery and schema evolution, and is compressible and splittable. It also offers complex data structures like nested types.

Avro was created by Doug Cutting, creator of Hadoop, and its first stable release was in 2009. It is often compared to other popular serialization frameworks such as Protobuff and Thrift.

Avro relies on schemas. When Avro data is read, the schema used when writing is also present. This permits each datum to be written with no per-value overheads, making serialization both fast and with smaller file sizes.

Avro specifies two serialization encodings: binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. However, for debugging and web-based applications, the JSON encoding may sometimes be appropriate.

  • Row-based
  • Direct mapping from/to JSON
  • Interoperability: can serialize into Avro/Binary or Avro/Json
  • Provides rich data structures
  • Map keys can only be strings (could be seen as a limitation)
  • Compact binary form
  • Extensible schema language
  • Untagged data
  • Bindings for a wide variety of programming languages
  • Dynamic typing
  • Provides a remote procedure call
  • Supports block compression
  • Avro files are splittable
  • Best compatibility for evolving data schemas
Avro File Structure

Apache Parquet is a column-oriented binary file format intended to be highly efficient for the types of large-scale queries.

Parquet came out of a collaboration between Twitter and Cloudera in 2013 and it uses the record shredding and assembly algorithm described in the Dremel paper.

Parquet is good for queries scanning particular columns within a table, for example querying “wide” tables with many columns or performing aggregation operations like AVG() for the values of a single column.

Each data file contains the values for a set of rows (“the row group”). Within a data file, the values from each column are organized so that they are all adjacent, enabling good compression for the values from that column.

  • Column-oriented
  • Efficient in terms of disk I/O and memory utilization
  • Efficiently encoding of nested structures and sparsely populated data.
  • Provides extensible support for per-column encodings.
  • Provides extensibility of storing multiple types of data in column data.
  • Offers better write performance by storing metadata at the end of the file.
  • Records in columns are homogeneous so it’s easier to apply encoding schemes.
  • Parquet supports Avro files via object model converters that map an external object model to Parquet’s internal data types
Parquet File Structure

Apache ORC (Optimized Row Columnar) was initially part of the Stinger intiative to speed up Apache Hive, and then in 2015 it became an Apache top-level project.

The ORC file format is columnar type format that provides a highly efficient way to store Hive data.

It was created in 2013 by Hortonworks to optimize existing RCFiles in collaboration with Microsoft. Nowadays, the Stinger initiative heads the ORC file format development to replace RCFiles

The main goals of the ORC are to improve query speed and to improve storage efficiency.

ORC stores collections of rows in one file and within the collection the row data is stored in a columnar format. This allows parallel procession of row collections across a cluster.

  • Column-oriented
  • Lightweight indexes stored within the file
  • Ability to skip row groups that don’t pass predicate filtering
  • Block-mode compression based on data type
  • Includes basic statistics (min, max, sum, count) on columns
  • Concurrent reads of the same file using separate RecordReaders
  • Ability to split files without scanning for markers
  • Metadata is stored using protobuf, which allows schema evolution by supporting addition and removal of fields
    ORC File Structure

    How to choose a Data Format?

    Choosing a data format is not always black and white, it will depend on several characteristics including:

    • Size and characteristics of the data
    • Project infrastructure
    • Use case scenarios

    We discuss the different considerations in more detail in our blogpost.


    Data Formats to the Test

    Background

    We set out to create some tests so we can compare the different data formats in terms of speed to write and speed to read a file. You can recreate the tests in your own system with the code used in this blogpost which can be found here: SVDS Data Formats Repository.

    Disclaimer: We didn’t design these tests to act as benchmarks. In fact, all of our tests ran in different platforms with different cluster sizes. Therefore, you might experience faster or slower times depending on your own system’s configuration. The main takeaway is the difference of the different data formats performance within the same system.

    The Setup

    We generated 3 different datasets to run the tests:


    Narrow Dataset

    10 columns

    10 million rows

    Resembles an Apache server log with 10 columns of information

    Wide Dataset

    1000 columns

    4 million rows

    Includes 15 columns of information and the rest of the columns resemble choices. The overall dataset might be something that keeps track of consumer choices or behaviors

    Large Dataset

    1000 columns

    302,924,000 rows

    1 TB of data

    Follows the same format as the wide dataset


    We performed tests in Hortonworks, Cloudera, Altiscale, and Amazon EMR distributions of Hadoop.

    For the writing tests we measured how long it took Hive to write a new table in the specified data format.

    For the reading tests we used Hive and Impala to perform queries and record the execution time each of the queries.

    We used snappy compression for most of the data formats, with the exception of Avro where we additionally used the deflate compression.


    The queries ran to measure read speed, were in the form of:

    SELECT COUNT(*) FROM TABLE WHERE …

    Query 1 includes no additional conditions.

    Query 2 includes 5 conditions.

    Query 3 includes 10 conditions.

    Query 4 includes 20 conditions.


    Write Results


    Hive
    Narrow Dataset
    Wide Dataset
    Large Dataset
    Narrow Dataset
    Wide Dataset

    Read Results


    Hive Impala
    Map Reduce: Narrow Dataset
    Map Reduce: Wide Dataset
    Tez: Narrow Dataset
    Tez: Wide Dataset
    Narrow Dataset
    Wide Dataset
    Large Dataset
    Narrow Dataset
    Wide Dataset
    Narrow Dataset
    Wide Dataset
    Narrow Dataset
    Wide Dataset
    Large Dataset

    Team


    Stephen O'Sullivan

    Stephen O’Sullivan

    VP of Engineering

    @steveos

    Silvia Oliveros

    Silvia Oliveros

    Data Engineer

    @soliverost