In a previous post, my colleague Mark Mims discussed a variety of data pipeline designs using Spark and Kafka. In this post, we’ll dig a little deeper into making those pipelines manageable. It’s best if you read Mark’s post first, to have the proper foundation for this one. To get the most out of this post, you should also be familiar with Spark and Kafka architectures.
Mark noted that you shouldn’t build a custom data pipeline if you can solve your problem with database queries against data at rest; I’d like to reiterate that. Custom data pipelines are a development cost, and there is extensive management for them after they are built. Here, we talk about aspects that should be designed into your development effort to make this extensive management possible and efficient, if you absolutely must take it on.
The DevOps movement and modern tools have made developers aware of build and deployment issues, but operationalizing your software doesn’t end there. There are things we can do in our designs and code to help with the operations concerns that happen after provisioning, build, and deployment. Management of your Kafka/Spark data pipeline will involve tuning and retuning as it hits production volumes as well as changing conditions.
A common ingest pattern is an architecture with Spark Streaming reading out of source Kafka Queues, and then Spark Streaming filtering the records and shunting them off to their destination Kafka Queues, followed by Spark Streaming jobs that write from those queues to persistence (and in other cases to services). The multiple sets of queues separated by processing forms an impedance matching network in which each layer is tuned so that data flows and is not bottled up at any layer. All the layers are horizontally scalable so it is just a matter of getting each layer to the right size.
Kafka/Spark data pipelines are custom code based on some very stripped down system software, with all the dials available to you so that you can build exactly what you need. Building the system that matches the various impedances is not done automatically, but is something that has to be designed into your software.
Let’s start by understanding what the dials are in each of our components.
Kafka tuning knobs
Kafka is not the Ferrari of messaging middleware, rather it is the salt flats rocket car. It is fast, but don’t expect to find an AUX jack for your iPhone. Everything is stripped down for speed.
Compared to other messaging middleware, the core is simpler and handles fewer features. It is a transaction log and its job is to take the message you sent asynchronously and write it to disk as soon as possible, returning an acknowledgement once it is committed via an optional callback. You can force a degree of synchronicity by chaining a get to the send call, but that is kind of cheating Kafka’s intention. It does not send it on to a receiver. It only does pub-sub. It does not handle back pressure for you.
The core is also more complex than other middleware because it is distributed. This allows it to scale horizontally and also, through replication, it can provide high availability guarantees. To do this, Kafka requires the application writer to decide on the number of partitions to make for a topic. Additionally, the producer has to define the partitioning code.
Unlike most messaging middleware, consumers have to manage their own place in the queue, as well as manage partitions. This makes the consumer side of Kafka more complicated, but is addressed in the case of Spark’s consumer. In Spark, there is a direct API that automatically maps Spark partitions to Kafka partitions for a topic. With a bit of extra work, it provides exactly-once semantics.
The decoupling of publishing and consuming (via subscription) means that consumers need to get to the published data before it stops being held in the queue. One of the interesting aspects of Kafka is that it is designed and is generally used as a stream data store, not just as a queue. The default broker retention is seven days. On the other hand, it does have a retention policy and so you have to be aware of this. Broker retention, defined as time and ingest rates, determine the storage capacity required by the brokers (multiplied by the replication factor). Brokers have a default retention value, but you can also override this at the topic level.
To address the issues described above, here is a list of Kafka configuration options and functions that your program should strive to surface to the operations teams that will be managing your application. While these are standard configuration options, as a developer with an eye toward making your program more manageable, you can highlight how these aspects affect the performance of your app. You can also make it clear, through documentation and run books, what choices have been made in your app on these items:
- Number of Partitions Per Topic.
- Measures of how well the Partitioning code is spreading records.
- Broker default retention and topic retention for capacity planning and read access patterns (for example, if there are readers that do weekly reports).
- Feedback on acks from producer to implement backpressure and handle dropped messages.
- KafkaProducer.send also provides several settings that affect performance:
- Acks sets when the broker notifies (acknowledges) receipt of the message sent.
- Retries sets how many times kafka retries the send for recoverable errors such as when the brokers are available but there is a leader election in progress.
- batch.size—Number of records to send at once in a single send.
- linger.ms—time to wait for new records if there isn’t batch.size records.
- Buffer.memory—Limit on the size Producer buffer for sending. If this is exhausted, send will block in the Producer.
Spark Streaming tuning knobs
The key thing about Spark Streaming is that it is built on the same lazy operations on distributed collections as Spark. In Spark Streaming, the collections are micro batches that are emitted from the source at a time interval.
You set the batch duration when setting up the StreamingContext and then you create a DStream using the direct API for kafka. You can keep this DStream from overwhelming your Spark streaming processing by setting the spark.streaming.kafka.maxRatePerPartition. The number of Kafka topic partitions determines the number of Spark tasks.
Spark Streaming consumer tuning knobs include:
- The number of partitions Kafka sets for this topic—determines how many tasks Spark Streaming Direct API uses to process this topic.
- The StreamingContext’s batchDuration—determines how often a micro batch is sent for processing.
- DStream’s spark.streaming.kafka.maxRatePerPartition times the batchDuration in seconds—determines the largest microbatch sent to a spark task.
Spark Streaming performs some processing and then writes to a Kafka topic for writing to the persistence, whether it be files or database. This processing can include:
- Classifying ingested events by event type
- Decorating event with reference info
- Filtering out bad event data
Classifying and filtering involve categorizing events and then doing specific processing for subsets of the categorized events. You really have two strategies for this subset processing. One strategy is to decorate the event with a category, and handle all the streams categories in your consumer. One advantage of this is not having to increase the number of queues to manage, on the other hand it makes the consumer more complex and it also mixes types that have more processing with ones that have less.
The other option is to split out the events into multiple queues. For each of these queues you are going to have to choose the number of partitions based on the volume going to that stream and the level of parallelism you want on the Spark consumer side. Here is a code sample of writing out data from an RDD to a Kafka topic.
For our discussion on tuning, note that these are not tied to the partitioning of the source Kafka topic. The new Kafka topics on the other side of this processing will be partitioned with its readers in mind. Additionally, we now have Kafka topics with a bridge of Spark Streaming. The working of the system involves the messages flowing through a set of stages with a variety of possible paths.
Finally, as I mentioned earlier in the Kafka section, Kafka does not handle back pressure for you on writes and is geared toward asynchronous inserts. What you do for this will determine what you need to tune, so let’s go through the options. Here is a code sample of KafkaProducer with both sync and async calls defined.
If you can’t be bothered with complexity and wish you had used something simpler instead of Kafka, you can write synchronously. You do this by calling chaining a get() call to your send. This will show you if you had an error and you can take care of it all right there. This should slow you down enough so that you don’t have to worry about back pressure from Kafka. Problem solved!
The opposite strategy with similar simplicity is to send it asynchronously but set the acks to 0, which means that the broker never acknowledges that it received the message. This is very fast, but you have no feedback on failed sends.
Finally, there is the asynchronous call with callback and acks set to 1 or “all.” In this case the callback is called on the producer. This happens on the producers io thread, meaning sends are impacted so you shouldn’t do much here regularly, but this is where you can log errors and start implementing back pressure code.
To address the issues described above, as a developer with an eye toward making your program more manageable, you can instrument your code to surface the the following items concerning the Spark aspects of your application:
- Measures of micro batch processing time versus microbatch interval
- Errors in writes to the output topics
- Counts of messages sent versus messages in the destination topic
- Failure levels and implemented back pressure
Finally, there are a couple application architecture choices that we haven’t discussed (they are both worth posts of their own) but should be considered:
- Dealing with unparsable records and continuing to process; when to throw errors, and how to deal with bad records.
- The design of the pipeline itself is a data flow graph that you have control over. Seeing how the whole thing performs together will help in any tuning exercise.
You should now have a better idea of the tuning elements you need to make readily available to the operations team for the lifetime of your data pipeline. Hopefully this has whetted your appetite for engineering aspects of developing code that is successful operationally in production. The key to this is developing a deep understanding of the internal architecture of Spark and Kafka.
Spark/Kafka Data Pipelines are complex applications whose value will be measured by how performant and manageable they are. They inherently involve on-going management and adjustment to changing conditions. When designing your data pipeline, have a clear idea of what those operational parameters are and make it clear, through runbooks and management interface features, what tuning choices were made in your application and which turning parameters are available to the operations team.
There are some great resources for both Spark and Kafka:
- The experts. Read and see everything you can from Holden Karau (Spark) and Gwen Shapira (Kafka).
- Material from the founders. Jay Kreps (Kafka) is a great writer and speaker, and on the Spark side there are many more key contributors that have put out a lot of information on the internals. Tathagata Das, Aaron Davidson, Patrick Wendell, and Matei Zaharia can take you down the Spark rabbit hole.