Do you fully understand how your systems operate? As an engineer, there is a lot you can do to aid the person who is going to manage your application in the future. In a previous post we covered how exposing the tuning knobs of the underlying technologies to operations will go a long way to making your application successful. Your application is a unique project—it’s easier for you to learn the operational aspects of the underlying technologies, than for others to learn the specifics of all the applications.
Notice I said “the person who is going to manage your application in the future” and not “operations.” The line between development and operations is blurring. The importance of engineering knowledge in operations, and the importance of getting engineers to put operational thinking into their solutions, is really what is behind Amazon CTO Werner Vogel’s famous edict:
“Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software.”
Monitoring and alerting is not traditionally seen as a development concern, but if you had to manage your application in production, you would quickly want to figure it out. In this post, we will cover some of the basics of monitoring and alerting as it relates to data pipelines in general, and Kafka and Spark in particular.
What needs to be monitored?
Besides alerting for the hardware health, monitoring answers questions about the health of the overall distributed data pipeline. The Site Reliability Engineering book identifies “The Four Golden Signals” as the minimum of what you need to be able to determine: latency, traffic, errors, and saturation.
Latency is the time it takes for work to happen. In the case of data pipelines, that work is a message that has gone through many systems. To time it, you need to have some kind of work unit identifier that is reflected in the metrics that happen on the many segments of the workflow. One way to do this is to have an ID on the message, and have components place that ID in their logs. Alternatively, the messaging system itself could manage that in metadata attached to the messages.
Traffic is the demand from external sources, or the size of what is available to be consumed. Measuring traffic requires metrics that either specifically mean a new arrival or a new volume of data to be processed, or rules about metrics that allow you to proxy the measure of traffic.
Errors are particularly tricky to monitor in data pipelines because these systems don’t typically error out on the first sign of trouble. Some errors in data are to be expected and are captured and corrected. However, there are other errors that may be tolerated by the pipeline, but need to be feed into the monitoring system as error events. This requires specific logic in an application’s error capture code to emit this information in a way that will be captured by the monitoring system.
Saturation is the workload consuming all the resources available for doing work. Saturation can be the memory, network, compute, or disk of any system in the data pipeline. The kinds of indicators that we discussed in the previous post on tuning are all about avoiding saturation.
In each of these cases, imagine an operations person trying to determine what these are for your app, from only the deployment information. As a developer of the data pipe, you have unique information about the functioning of the application, and also have an opportunity to produce and expose this key information in your code.
Kafka and Spark monitoring
On the Kafka side, dials and status aren’t enough for a pipeline—we need to see end to end. You owe it to yourself to look at the Kafka Control Center. This tool adds monitors at each producer and consumer, and gives end to end metrics explorable by time frame, topic, partition, consumer, and producer. If you are not getting the enterprise license, look at the features of Kafka Control and add its features to your custom efforts.
Another framework to look into for monitoring Kafka is the Kafka Monitor, which sets up a test framework that essentially allows you to build end to end tests. You will want tests that exercise the elements listed in the paragraph above.
Spark and Kafka both have the ability to be monitored via JMX, Graphite, and/or Ganglia. JMX makes sense if you are integrating it into an existing systems. Graphite excels at building ad hoc dashboards, and Ganglia is a good monitoring infrastructure with deep roots in the Hadoop monitoring ecosystem. These systems will give you a good feel for overall throughput, efficiency of reads and writes, and provide values needed to match batch duration versus amount of compute per batch.
Finally, if you really want to get a picture of the data graph as a whole then you can implement distributed tracing. Open Zipkin is a implementation of the Google Dapper paper and is a whole other level of tracing. With distributed tracing, your monitoring/tracing system pulls together your event stream into a single trace that includes all nodes through the system. Distributed traces provide a tree of all the events that happened, with a packet of data as it goes through the system. Distributed traces also provide information from all the nodes of a system for a given message.
While this provides all the info you might need in a single message, it produces a lot of data, and so the Dapper-based systems are based on sampling. Since all events are not sampled, it can only be effectively used for persistent problems.
Alerting for data pipelines
Alerting systems should alert at the business service level (“customer data is not getting to the database), not at the error event level (“process on node 3 failed”).
A business service is made up of of one or more jobs; those jobs are composed of one or more instances of code working on some infrastructure. When errors start happening on various instances of code, or some key computers go down, an avalanche of error events start accumulating. These events need to be easy to find when the conditions kick off a service failure alert. A good alerting system should have many events, but only one alert per service-level problem. Additionally, events should be indexable by the services they impact.
For your pipeline to succeed in production, you must create documentation of what to look for, as well as the hierarchy of events—from instances, to jobs, to services. When an alert triggers incident management, the person handling it should have enough information to know what to look for, and then be able to quickly look for the problem. For example, they may ask themselves:
- Are any of the hosts associated with this service down?
- Is data flowing end to end, are any queues backed up?
- Which jobs in this service are showing error events?
The systems identified in monitoring will also have alerting mechanisms. Your infrastructure team likely already has an alerting system and so you will need to find out what is being used in your organization and conform to that. Generally, these systems have simple ways for you to submit your alerts; that is the easy part. You will also need to work out what happens when someone gets alerted.
The reality of alerting without developer input
What I’ve described thus far is what would happen in an ideal world (and what you should work toward in your own systems). The reality, however, is that we often encounter alerting systems that do not take these precautions. Without the critical information about the specifics of the applications, operations builds a system based on trial and error.
Most of us are in organizations in the middle and hopefully this post has encouraged you to learn more about how to instrument your system (or has validated you if you’re already doing so). Not sure where to start? Contact us to learn more about how we can help.
A particularly inspirational talk on adopting an operational development outlook is from Theo Schlossnagle: A Career in Web Operations.
Below, for ease, is a list of the links used in this post (along with a few extras).
- Site Reliability Engineering
- Kafka Control Center
- Kafka Monitor
- Open Zipkin
- JConsole will get you started on JMX but its just a start.
- Prometheus is a newer monitoring and alerting system with some interesting ideas. Here is a presentation on Monitoring Kafka with Prometheus.