Building Data Systems: What Do You Need?

May 24th, 2016

In previous posts, we’ve looked at why organizations transforming themselves with data should care about good development practices, and the characteristics unique to data infrastructure development. In this post, I’m going to share what we’ve learned at SVDS over our years of helping clients build data-powered products—specifically, the capabilities you need to have in place in order to successfully build and maintain data systems and data infrastructure. If you haven’t looked at the previous posts, I would encourage you to do so before reading this post, as they’ll provide a lot of context as to why we care about the capabilities discussed below. Please view this post as a guide, laid out in easily-visible bullet points for quick scanning.

A few key points before we start:

To reiterate what was discussed in earlier blog posts, the points discussed in the sections below are shaped by automation, validation, and instrumentation: the concepts that drive successful development architecture.
Much of what is covered in this post comes from continuous delivery concepts. For a more detailed overview of continuous delivery, please take a look at existing sources.
Whether you hold these capabilities explicitly in house or not depends on the constraints and aspirations of your organization. They can (and often do) manifest themselves as managed services, PaaS, or otherwise externally provided.
The implementation specifics of these capabilities here are going to be determined by the constraints and goals unique to your organizations. My hope is that the points below highlight what you should focus on.

Infrastructure

Data engineers must be as conscious of the specifics of the physical infrastructure as that of the applications themselves. Though modern frameworks and platforms make the process of writing code faster and more accessible, the scale in terms of data volume, velocity, and variety of modern data processing means that conceptually abstracting away the scheduling and distribution of computation is difficult.

Put another way, engineers need to understand the mechanics of how the data will processed, even when using frameworks and platforms. SSD vs. disk, attached storage or not, how much memory, how many cores, etc. are decisions that data engineers have to make in order to design the best solution for the targeted data and workloads. All of this means reducing friction between developer and infrastructure deployment is imperative. Below are some important things to remember when thinking about how to enable this:

Infrastructure monitoring and log aggregation are imperative as the number of nodes used increases throughout your architecture.
The focus should be on repeatable, automated deployments.
Infrastructure-as-code allows for configuration management through a similar toolchain as application code and provides consistency across environments.
For a number of reasons, many organizations will not allow developers to directly provision infrastructure and will typically have a more specialized operations/network team to handle those responsibilities. If this is the case then providing clear, direct infrastructure deployment request processes for the developer with the ability to validate is necessary.

Testing/QA

As the level of target scope increases in the testing sequence, testing for data infrastructure applications begins to deviate from traditional applications. While unit testing and basic sanity checks will look the same, the distributed nature of many data applications make traditional testing methodologies difficult to fully replicate. Below are specific issues:

As with any application development, code reviews, code coverage checks, code quality checks, and unit tests are imperative.
Using sample data and sample schemas becomes more important since pipelines mean the data becomes the integration point.
Having as much information about the data and associated metadata is critical for developers to rationalize fully about how to build and test the application.
Developers should be able to access a subset, a sample, or at worst a schema sample (in order to generate representative fake data).
Metadata validation is as important as data validation.
Duplicating environments in order to establish an appropriate code promotion process is equally important but harder: locally duplicating distributed systems takes multiple processes and actually replicating the distributed nature of the cluster setup is tough without a lot of work. Typically, this issue can be mitigated by investing in infrastructure automation, to be able to deploy the underlying platforms for testing in multiple environments rapidly.
The scale of the data often prohibits complete integration tests outside of performing smoke tests in the production cluster as, for example, testing a 10 hour batch job is not practical.
Performance testing will be iterative. The cost of duplicating the entire production environment is often prohibitive, so performance tuning will need to take place in something close to a prod environment. An alternative to this is having push-button/automated system to spin up instances just for performance tuning.
Further complicating performance testing is the fact that resource schedulers are often involved.
Running distributed applications often means multiple processes are creating logs. It’s important to enable your developers to diagnose issues by implementing log aggregators and search tools for logs.
Since certain issues only manifest themselves when concurrency is introduced, testing in concurrent mode should be done as early as possible. This means making sure developers are able to test concurrency on their local environments (some frameworks allow for this, e.g using Spark’s local mode with multiple threads).

Build

Dependency management for distributed applications is HARD. It’s necessary not only to maintain consistency across promotion environments (dev, test, QA, prod), but within the clustered machines within each of those environments. The distributed nature of many of the base technologies coupled with the prevalence of frameworks in the Big Data ecosystem means that when it comes to dependency management organizations have to make a decision to either 1) cede management of shared libraries to the platform (usually the operations team) and make sure that developers can maintain version parity; or 2) cede control to developers to manage their own dependencies. Some more specifics below:

While the polyglot nature of data infrastructure development will tempt developers toward manual packaging and manual deployment (e.g. on an edge node), packaging standards should be enforced regardless of language or runtime. Choose a packaging strategy for the set of technologies at hand and establish an automated build process.
Understand the impact that maintaining multiple languages and runtimes has on your build process.
Pipelines themselves need to be either managed using something like Oozie (in hadoop ecosystem) or reliably managed through automated scripting (e.g. using cron).
For a traditional application you can version all configurable elements (source, scripts, libs, os configs, patch levels, etc), but with the current state of the technology in enterprise BD, multiple applications are running on a specific software stack (e.g. CDH distro). This means the change set for configurations for an app can not lie entirely within that apps repo. At best, the configurations are across separate repos, but with managed stacks like CDH, configuration versioning is typically handled internally to the platform software itself.

Deployment

As with testing and build, automated deployments and release management processes are crucial. Below are some things to consider:

The use of resource managers in modern data systems means that whatever deployment process is in place must account for resource requests or capacity scheduling, as well as related feedback to the development team.
The cost of maintaining multiple large clusters is often prohibitive to fully duplicating the prod environment. Therefore, you will have to deal with mismatch-sized clusters or logically separate multiple “environments” on the same cluster.
Performance and capacity testing will be iterative. It’s often difficult to deterministically judge ahead of time the exact resource configuration needed to optimize performance. Therefore, even in prod the roll out is will require multiple steps.
It’s appropriate for packaged artifacts that maintain jobs or workflows to live on edge nodes. Source code should not.

Operations

Commercial data infrastructure technologies (Hadoop distros and the like) are often complex integrations of multiple distributed systems. Operations processes must account for a multitude of configurations and monitoring of distributed processes. Some additional points:

Upgrade strategy (i.e. whether or not to stay on the latest supported versions of technologies) is important since in a production environment we have to verify regression tests are passing and validate that developers are in sync regarding the new versioning preferably through explicit dependency management automation. Complicating things is the fact that many platforms are multi-system integrations makes it more important to test and stay on supported version sets. In general, the bias should be toward the latest supported version set, since the feature set expands/evolves quickly, but the most important thing is to establish a verifiable, versionable process, that can be rolled back if necessary.
With Hadoop distros or other frameworks, operations infrastructure often incorporates execution time reporting (e.g. Spark dashboard), in order to validate and diagnose applications.
Many platforms, especially commercial offerings, provide UIs for configuration management. While they are useful to get information about the system, configurations should be managed through an automated, versionable process.

Runtime

Having an explicit strategy for both resource management and dependency management are critical for reconciling developer productivity, performance at scale, and operational ease. Below you’ll find some of the key points to consider:

The feature set of specific distributions and component versions affect functionality. Modern data infrastructure development and Big Data are an emerging practice so the feature set of specific frameworks expands/evolves quickly but commercially supported distributions move at a slower pace. It’s important to explicitly define and communicate the versions of software you will be using, so that developers understand the capabilities that they can leverage from certain frameworks.
Never begin development using a newer version of a framework or API unless you have a plan for rolling out the update in production. Since you should be deploying to production as early as possible then this would mean this would happen almost concurrently.
Data platforms often aim for an org-wide, cross-functional audience set. This means that you will need to have a way to manage authorization and performance constraints/bounds for different groups. Resource managers and schedulers typically provide the mechanisms for doing this, but it will be up to your organization to make the decision as to who gets what (Have fun!).
Using resource managers will be your best bet from an operations standpoint to align your business needs to the performance profile of specific jobs/apps. The simplest model for this is essentially a queue, but can take, and most times likely should, take the form of group-based capacity scheduling. The trade-off here is that organization coordination/prioritization is a must.

Developers

Having available developers with deep knowledge of your technology stack is critical to success with data efforts. In addition, it’s necessary to provide those developers with the tools to do their job:

Desktop virtualization/containerization tools like VirtualBox, Vagrant, or Docker will allow developers to more easily “virtualize” the production environment locally.
Developers should be able to duplicate the execution environments locally as closely as possible. Meaning that if the execution machines are linux based and your workstations are windows, by necessity, tools like VirtualBox, Vagrant, Docker will need to be used.
You must align training plans with roadmap of platform/product. For example, if you plan on using modern data infrastructure, it is essential for developers to understand distributed systems principles.
You must allocate the appropriate development and operation personnel (or equivalent managed services) for the lifespan of the product/platform.

Product Management

The platform aspect of data infrastructure development calls for strong coordination of consumer needs, developer enablement, and operational clarity. As such, it’s important to engage all relevant parties as early as possible in the process. Validate results against the consumer early and begin establishing the deployment process as soon as the first iteration of work begins. Other tips:

Address operations and maintenance concerns as early as possible by beginning to deploy ASAP during development in order to iterate and refine any issues.
Much of data work is inherently iterative, reinforcing the need for feedback support systems such as monitoring tools at operational level and bug/issue trackers at the organizational level.
Like most modern software systems, getting feedback from end users is essential.
Establishing a roadmap for the data platform is imperative. Data platforms often aim for an organization-wide, cross-functional audience set. It is imperative to have a strategy on prioritizing consumer needs and onboarding new users/LoB/groups. The feature requests of the consumers must be treated as a prioritized set.
Availability, performance, etc. requirements have a large impact on design due to the complexities of distributed systems in general and current Big Data technologies specially, so these things should be thought about as early as possible.
Any technology platforms or distribution used in production must have licensed support.

There are a lot of capabilities to think about, and it can seem daunting. Remember that at the highest level, you are aiming to give your teams as much visibility as possible into what’s happening through instrumentation, implementing processes that provide validation at every step, and automating the tasks that make sense. Of course, we are here to help.

Hopefully, the points highlighted above will be useful as you are establishing your development and operations practices. Anything you think we missed? Let us know your thoughts below in the comments

Infrastructure

Testing/QA

Build

Deployment

Operations

Runtime

Developers

Product Management

One Year Later, Observations on the Big Data…

Thank You

Materialized Views with Cassandra

Welcome to Silicon Valley Data Science

Sign In