San Francisco skyline

DataEngConf 2017

DataEngConf features talks and workshops aimed at bridging the gap between data scientists, data engineers, and data analysts. We’ll be there, giving tips on choosing the right format for your data. You can also check out our R&D project on the topic.

Friday, April 28

Format Wars: from VHS and Beta to Avro and Parquet


You have your Hadoop cluster, and you are ready to fill it up with data, but wait: Which format should you use to store your data? Should you store it in Plain Text, Sequence File, Avro, or Parquet? (And should you compress it?) HDFS or Block/Object Store? Which query engine? This talk will take a closer look at some of the trade-offs, and will cover the How, Why, and When of choosing one format over another.

Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others. Adding to the data formats selection is which query engine works best for the data format & workload. Oh lets not forget the question: “Do I store that in HDFS or a block/object store?”

This talk will take a closer look at some of these trade-offs. Attendees will learn, based on a few real world use cases, the How, Why, and When of choosing one format over another (and will your choice of query engine affect this.). Covering the four major data formats (Plain Text, Sequence Files, Avro, and Parquet) we will provide insight into what they are and how to best use and store them in HDFS or a block/object store.