Rather than thinking about free format text files, in the Hadoop ecosystem we are used to thinking about delimited files such as csv, and tsv. We can also think about json records as long as each line is its own jason datum.
Text files are a convenient format to use to exchange data between Hadoop and external applications and systems that produce and/or read delimited text files.
Apache Hadoops’s SequenceFile provides a persistent data structure for binary key-value pairs. Hadoop provides with instances and methods so that we can write key-value pairs in SequenceFiles.
There are 3 different formats for SequenceFiles depending on the Compression Type specified:
The SequenceFile is the base data structure for the other types of files like MapFile, SetFile, ArrayFile, and BloomMapFile.
Some of the common use cases include: the transferring of data between Map Reduce jobs and as an archive/container to pack small Hadoop files where the metadata(filename, path, creation time) is stored as the key and the file contents are stored as the value.
Apache Avro is widely used as a serialization platform, as it is interoperable across multiple languages, offers a compact and fast binary format, supports dynamic schema discovery and schema evolution, and is compressible and splittable. It also offers complex data structures like nested types.
Avro was created by Doug Cutting, creator of Hadoop, and its first stable release was in 2009. It is often compared to other popular serialization frameworks such as Protobuff and Thrift.
Avro relies on schemas. When Avro data is read, the schema used when writing is also present. This permits each datum to be written with no per-value overheads, making serialization both fast and with smaller file sizes.
Avro specifies two serialization encodings: binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. However, for debugging and web-based applications, the JSON encoding may sometimes be appropriate.
Apache Parquet is a column-oriented binary file format intended to be highly efficient for the types of large-scale queries.
Parquet came out of a collaboration between Twitter and Cloudera in 2013 and it uses the record shredding and assembly algorithm described in the Dremel paper.
Parquet is good for queries scanning particular columns within a table, for example querying “wide” tables with many columns or performing aggregation operations like AVG() for the values of a single column.
Each data file contains the values for a set of rows (“the row group”). Within a data file, the values from each column are organized so that they are all adjacent, enabling good compression for the values from that column.
Apache ORC (Optimized Row Columnar) was initially part of the Stinger intiative to speed up Apache Hive, and then in 2015 it became an Apache top-level project.
The ORC file format is columnar type format that provides a highly efficient way to store Hive data.
It was created in 2013 by Hortonworks to optimize existing RCFiles in collaboration with Microsoft. Nowadays, the Stinger initiative heads the ORC file format development to replace RCFiles
The main goals of the ORC are to improve query speed and to improve storage efficiency.
ORC stores collections of rows in one file and within the collection the row data is stored in a columnar format. This allows parallel procession of row collections across a cluster.
Choosing a data format is not always black and white, it will depend on several characteristics including:
We discuss the different considerations in more detail in our blogpost.
We set out to create some tests so we can compare the different data formats in terms of speed to write and speed to read a file. You can recreate the tests in your own system with the code used in this blogpost which can be found here: SVDS Data Formats Repository.
Disclaimer: We didn’t design these tests to act as benchmarks. In fact, all of our tests ran in different platforms with different cluster sizes. Therefore, you might experience faster or slower times depending on your own system’s configuration. The main takeaway is the difference of the different data formats performance within the same system.
We generated 3 different datasets to run the tests:
10 million rows
Resembles an Apache server log with 10 columns of information
4 million rows
Includes 15 columns of information and the rest of the columns resemble choices. The overall dataset might be something that keeps track of consumer choices or behaviors
1 TB of data
Follows the same format as the wide dataset
We performed tests in Hortonworks, Cloudera, Altiscale, and Amazon EMR distributions of Hadoop.
For the writing tests we measured how long it took Hive to write a new table in the specified data format.
For the reading tests we used Hive and Impala to perform queries and record the execution time each of the queries.
We used snappy compression for most of the data formats, with the exception of Avro where we additionally used the deflate compression.
The queries ran to measure read speed, were in the form of:
SELECT COUNT(*) FROM TABLE WHERE …
Query 1 includes no additional conditions.
Query 2 includes 5 conditions.
Query 3 includes 10 conditions.
Query 4 includes 20 conditions.