Small files can be a pain when working with Hadoop installations. When using HDFS, they can cause contention and significant memory utilization on the NameNode due to having to keep track of the metadata for each of the files. While dealing with small files is a common problem across all Hadoop distributions distributions, MapR-FS has unique difficulties.
In this post, we will discuss how dealing with small files is different if you are using MapR-FS rather than the traditional HDFS installation.
Working with MapR-FS
MapR-FS is a ground up rewrite of the Java-based HDFS in C/C++. It focuses on providing a POSIX-based file system interface with an emphasis on high performance and availability, and includes the ability to mount the filesystem directly via the standard NFS protocol. From here on out we will be talking specifically about MapR-FS and using MapR-FS specific terminology.
A volume is a logical unit used to organize data into groups, to manage your data and apply policy all at once instead of file by file. The volume structure defines how data is distributed across the nodes in your cluster. Each MapR-FS volume that is created is associated with a name container.
The name container holds the first 64KB of any file (including metadata and file data) and is replicated by default across three nodes. Files created beneath the 64KB threshold will only live in the name container. It is useful to keep track of the size of the name container because when it gets large certain operations can become inefficient. Per MapR API documentation, it is advisable to utilize a different volume if the size of the name container reaches 64 GB.
In a perfect world you are able to modify your data ingestion and processing pipelines to minimize the creation of small files, but unfortunately we don’t always live in a perfect world.
There are, however, a number of actions that that can alleviate bottlenecks.
Here are three potential solutions to dealing with an over-abundance of small files in MapR FS:
- Coalesce small files into larger files. SVDS has performed this for many clients using everything from third party utilities such as filecrush, to buffering writes within Kafka, and using Hive and Impala to read and re-process the underlying data.
- Map out your MapR FS volumes to logically deal with the creation of the metadata and processing overhead of dealing with small files. Creating new volumes can spread out the load of small files.
- In a worst case scenario a quick fix can be to increase the replication factor of the name container files. Note: The maximum replication factor for a name container is six. This would decrease any file locality issues by distributing the small files in the name container across more nodes, but comes at a significant cost of cluster storage utilization.
In general, small files should be avoided in distributed filesystems due to the overhead in storing, processing, and maintaining them across multiple systems. If you can’t do that, then take a hard look at the execution flow of not only your data generation, but your your data processing pipelines. Doing so will help you determine which of the three options above is best for your situation. Do you have other tips? Share them in the comments.