In a previous post, I discussed why people trying to stand up new data infrastructure in any organization, regardless of industry, should be cognizant of their development best practices. I called out three concepts that would guide your thinking: automation, validation, and instrumentation.
While there is a good amount of literature (and tooling) out there related to these concepts for software development in general, it’s worthwhile to understand how data systems differ from most software. In this post, I will talk about the four characteristics that differentiate data infrastructure development from traditional development, and highlight key issues to look out for.
We’ll get this one out of the way: yes, it’s “Big” Data. The ramifications of scale on the design of data-intensive systems and applications are typically known ahead of time and (hopefully) accounted for. But, let me ask you this: are you really going to kick off a full test suite for a batch process that takes 10 hours to complete every time you make a change? The scale of the data sizes we are dealing with often prohibit “ideal” continuous delivery practice. It’s important to understand where and when to decompose problems into testable units and where and when to run full integration tests.
Data re-architecture efforts are often focused around standing up an enterprise data platform. What follows are multiple consumers, multiple sources, multiple pipelines, multiple languages, multiple tools, etc. This heterogeneity issue breaks down into two classes:
- Organizational: You’re building a platform. Congratulations! Get ready for everyone to be mad at you. In all seriousness, providing services to many different consumers across many different business functions is a tough order. The person responsible for the platform needs to be able to prioritize the different needs of stakeholders in order to provide the best aggregate result, and build a technological roadmap that can satisfy stakeholder needs in the short term while maintaining a long-term view of feature sets and capabilities. It requires the owner of the platform to take a product-like view of the platform.
- Technological: The cross-functional nature of these efforts means that technological heterogeneity is pervasive at every level of the stack. This manifests itself in infrastructure, languages, tooling, and required skills. While not inherently a problem, heterogeneity means that the development and operational architecture supporting the platform must address all the use cases. In addition, a platform strategy means that failures at the platform level affect the entire organization and can be costly. Having appropriate development and testing practices is imperative when the impact can be enterprise-wide.
As the Big Data ecosystem has matured, two trends have emerged to make data development more accessible: 1) execution frameworks and higher-level query engines that abstract away some of the specifics of distributed programming, and 2) commercially backed platforms that integrate multiple distributed subsystems such as databases, processing frameworks and tools to target a wide range of workloads. While these two trends have made development easier, they still have issues to account for.
To begin with, we’re still talking about distributed systems. They are notoriously difficult to rationalize about, and even though these frameworks and technologies have abstracted away some of the more difficult aspects, data engineers still need to understand the underlying fundamentals of how these systems are implemented and the ramifications on the solution, physical infrastructure, debugging, operations, and instrumentation. Even in a local environment, developers must make sure they are covering their bases by testing the possible failure modes that only emerge with distributed or concurrent execution. Things like partial failures, network issues, and just plain having too much data to fit in one machine are hard to simulate in a local environment.
In addition, while the need for programming complexity has been lessened, the overall complexity has shifted to operations and maintenance. These frameworks and platforms are often integrations of several distributed systems. Application layering and integration mean there is a preponderance of overlapping, dependent configurations. While configuration management is a basic tenet of any deployment/development process, under these circumstances it becomes even more important to establish versionable, validated, and automated configuration management.
Data infrastructure applications are unique. I don’t mean to say you won’t need to build traditional applications, but data infrastructure often puts a heavy emphasis on data integration and pipelining efforts versus traditional application development. Furthermore, there’s a strong emphasis on validation. When it comes to data ingestion and data science applications, you can’t trust your inputs. Or rather, you shouldn’t trust your inputs without validation. Modern data infrastructure enables a number of use cases that are prone to data instability, such as signal instability for model execution and schema evolution for data integration pipelines. Key development practices are those that account for metadata validation and allow you to roll back deployments of entire pipelines.
This post presented an overview of the unique characteristics of modern data systems. In the next post, I’ll talk about specific development capabilities and practices that will help ensure the success of data infrastructure development and operations, as well as mitigate some common issues.