Configuring a data platform and data science environment can be a tedious, error-prone process. As an answer to that challenge, we created a “push button” infrastructure tool.
In this post, Principal Engineer Mark Mims and Senior Solution Architect Heather Nelson provide more context around the project. Stay tuned for future posts that dive deeper into the technology.
What prompted the investigation into repeatable data platforms?
Mark: We help our clients with various kinds of data problems—from data strategy, to highly specialized data science or engineering problems.
If clients have their own infrastructure in place, we work within that framework. However, some client engagements are “greenfield” from an infrastructure standpoint. This can happen for a few couple reasons—either because we’re brought in to do a skunkworks project (often a client’s first foray into, say, cloud infrastructure), or because we’re helping that client put a production data pipeline in place from day one.
During several of these greenfield engagements, we noticed a pattern. We were assembling, even for clients in different industries, some of the same basic building blocks of a platform. Rather than re-do these from scratch each time, we wanted to put together a toolkit we could use to build data pipelines.
Heather: As we investigated the opportunity further we realized that the benefits extended even further beyond the desire for reuse.
For example, we realized that by industrializing our firm-wide knowledge we would also de-risk our future engagements by establishing common, tried, and true patterns for data workloads. Codifying our knowledge would also ensure a smooth transition to the client team at project conclusion so that they could learn from and carry forward our work.
Plus, the ability to easily spin up new platforms on demand creates a whole new world of opportunity to address data science workloads. One new use case is the ability to easily create ephemeral “Data Labs” that allow data scientists to experiment in their own environments during the model development process. They can do this in a manner that is both cost effective and does not impact the work of others.
How could someone sell this to leadership? Can you share any real world examples of the challenges this will help with?
Mark: This toolkit was started to scratch our own itch. The immediate benefits were clearly realized in the “ramp-up” required for engineering engagements, in both resources and time.
We soon realized that our clients really benefit from this in a number of ways we didn’t first anticipate. Certainly, it’s great to reduce delivery time for any data platform project. But there are also other benefits to having the ability to easily add new data platform instances to your infrastructure. Two main ones offhand are:
- infrastructure-level testing
- separation of ingestion from analytics
What groups or teams does an organization need to start building their own platforms?
Heather: Certainly the organization should employ data engineering and infrastructure experts. However, what really makes an organization successful is a willingness to be open minded and to experiment with new technologies and ways of working.
Even in our case, had we known what unexpected benefits we would achieve ahead of time, we would have started on this journey even sooner! Hindsight is 20/20 in that regard. In addition, breaking down silos between data scientists, data engineers, and operations is an important part of building platforms that work. The healthy dialog we have between people of all different skills and talents at our company allowed us to discover better ways of doing things that benefited everyone.
How would that group or team start?
Heather: Don’t reinvent the wheel! There are many great DevOps resources and thought leaders who can provide inspiration for organizations who are on the journey to build their own platforms. Also, SVDS will be releasing our platform as an open source version very soon, which is another great place to start.
In the future, we’ll be taking a deeper look at our “push button” tool. In the meantime, share your experiences in the comments—would this tool help you?