Data Formats

Overview Characteristics

Rather than thinking about free format text files, in the Hadoop ecosystem we are used to thinking about delimited files such as csv, and tsv. We can also think about json records as long as each line is its own jason datum.

Text files are a convenient format to use to exchange data between Hadoop and external applications and systems that produce and/or read delimited text files.

Text files are human readable and easily parsable
Text files are slow to read and write.
Data size is relatively bulky and not as efficient to query.
No metadata is stored in the text files so we need to know how the structure of the fields.
Text files are not splittable after compression
Limited support for schema evolution: new fields can only be appended at the end of the records and existing fields can never be removed.

Overview Characteristics Structure

Apache Hadoops’s SequenceFile provides a persistent data structure for binary key-value pairs. Hadoop provides with instances and methods so that we can write key-value pairs in SequenceFiles.

There are 3 different formats for SequenceFiles depending on the Compression Type specified:

Uncompressed format
Record compressed format
Block compressed format

The SequenceFile is the base data structure for the other types of files like MapFile, SetFile, ArrayFile, and BloomMapFile.

Some of the common use cases include: the transferring of data between Map Reduce jobs and as an archive/container to pack small Hadoop files where the metadata(filename, path, creation time) is stored as the key and the file contents are stored as the value.

Row-based
More compact than text files
You can’t perform specified key editing, adding, removal: files are append only
Encapsulated into the hadoop environment
Support splitting even when the data inside the file is compressed
The sequence file reader will read until a sync marker is reached ensuring that a record is read as a whole
Sequence files do not store metadata, so the only schema evolution option is appending new fields

Overview Characteristics Structure

Apache Avro is widely used as a serialization platform, as it is interoperable across multiple languages, offers a compact and fast binary format, supports dynamic schema discovery and schema evolution, and is compressible and splittable. It also offers complex data structures like nested types.

Avro was created by Doug Cutting, creator of Hadoop, and its first stable release was in 2009. It is often compared to other popular serialization frameworks such as Protobuff and Thrift.

Avro relies on schemas. When Avro data is read, the schema used when writing is also present. This permits each datum to be written with no per-value overheads, making serialization both fast and with smaller file sizes.

Avro specifies two serialization encodings: binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. However, for debugging and web-based applications, the JSON encoding may sometimes be appropriate.

Row-based
Direct mapping from/to JSON
Interoperability: can serialize into Avro/Binary or Avro/Json
Provides rich data structures
Map keys can only be strings (could be seen as a limitation)
Compact binary form
Extensible schema language
Untagged data
Bindings for a wide variety of programming languages
Dynamic typing
Provides a remote procedure call
Supports block compression
Avro files are splittable
Best compatibility for evolving data schemas

Overview Characteristics Structure

Apache Parquet is a column-oriented binary file format intended to be highly efficient for the types of large-scale queries.

Parquet came out of a collaboration between Twitter and Cloudera in 2013 and it uses the record shredding and assembly algorithm described in the Dremel paper.

Parquet is good for queries scanning particular columns within a table, for example querying “wide” tables with many columns or performing aggregation operations like AVG() for the values of a single column.

Each data file contains the values for a set of rows (“the row group”). Within a data file, the values from each column are organized so that they are all adjacent, enabling good compression for the values from that column.

Column-oriented
Efficient in terms of disk I/O and memory utilization
Efficiently encoding of nested structures and sparsely populated data.
Provides extensible support for per-column encodings.
Provides extensibility of storing multiple types of data in column data.
Offers better write performance by storing metadata at the end of the file.
Records in columns are homogeneous so it’s easier to apply encoding schemes.
Parquet supports Avro files via object model converters that map an external object model to Parquet’s internal data types

Overview Characteristics Structure

Apache ORC (Optimized Row Columnar) was initially part of the Stinger intiative to speed up Apache Hive, and then in 2015 it became an Apache top-level project.

The ORC file format is columnar type format that provides a highly efficient way to store Hive data.

It was created in 2013 by Hortonworks to optimize existing RCFiles in collaboration with Microsoft. Nowadays, the Stinger initiative heads the ORC file format development to replace RCFiles

The main goals of the ORC are to improve query speed and to improve storage efficiency.

ORC stores collections of rows in one file and within the collection the row data is stored in a columnar format. This allows parallel procession of row collections across a cluster.

Column-oriented
Lightweight indexes stored within the file
Ability to skip row groups that don’t pass predicate filtering
Block-mode compression based on data type
Includes basic statistics (min, max, sum, count) on columns
Concurrent reads of the same file using separate RecordReaders
Ability to split files without scanning for markers
Metadata is stored using protobuf, which allows schema evolution by supporting addition and removal of fields

Format Wars:
From VHS and Beta to Avro and Parquet

Overview on Data Formats

How to choose a Data Format?

Data Formats to the Test

Background

The Setup

Narrow Dataset

Wide Dataset

Large Dataset

Write Results

Narrow Dataset

Wide Dataset

Large Dataset

Narrow Dataset

Wide Dataset

Read Results

Map Reduce: Narrow Dataset

Map Reduce: Wide Dataset

Tez: Narrow Dataset

Tez: Wide Dataset

Narrow Dataset

Wide Dataset

Large Dataset

Narrow Dataset

Wide Dataset

Narrow Dataset

Wide Dataset

Narrow Dataset

Wide Dataset

Large Dataset

Team

Stephen O’Sullivan

Silvia Oliveros