Many project teams have found themselves in the situation where it seems that they are “so close” to completing a product rollout, but can never quite seem to get there. It’s as if there exists an invisible canyon between development and production that only heroic feats can overcome (images of tightropes, rickety bridges, and just making a leap for it all come to mind). Once they finally cross they’ll have several bumps and bruises to commemorate the journey.
IT practitioners may find some comfort in the fact that they are not alone. A recent study by the Standish Group showed that 71% of projects are unsuccessful due to cost overruns, time overruns, or failure to deliver expected results; these are often symptoms of a lack of proper development practices (if you’re looking for a real life example, just ask the team behind the famously botched HealthCare.gov rollout). And, unfortunately, the advent of distributed computing as the backbone of big data solutions has only made this problem more challenging.
Trust me, I’ve been there too. I know what it’s like to deal with complex production deployments that cover the gamut from infrastructure upgrades, to feature deployments, to data migrations, where each step threatens to derail the plan. In this post I’ll give an overview of obstacles I’ve faced (you may be able to relate) and talk about solutions to overcome these obstacles.
Causes of Unsuccessful Deployments
Often, I’ve found that the worst offenders are the seemingly innocent, small snags such as invalid configurations or outdated software versions that break code during runtime. In isolation, each snag isn’t a big deal; however, they can add up to a measurable overrun in total.
I have often seen that these small snags are rooted in a couple of fundamental issues: big bang deployments where tasks cannot be worked incrementally and development environments that do not match production. While these issues are already being addressed by the DevOps community, they become especially troublesome and noticeable when working with distributed big data systems.
Often — especially when working with big data — development environments are intentionally small to save on cost. If your production environment requires 50 data nodes to house your 100 TB of data, you probably aren’t going to replicate the exact same setup in a development environment. The consequences can be many, for example:
- Services are physically located in different places than they were in development
- Data processing underperforms once scaled to production needs
- Network latency and bandwidth constraints become noticeable at larger scales and as number of nodes increases
- Polyglot architectures, while useful for ensuring the right tools are available for the right problems, increase the complexity of managing multiple technologies at once
And so on.
Thus, each time one of these issues crops up you can expect last minute fixes, configuration changes, and rework, causing your deployment to take much longer than anticipated.
Solution: Architect for Agility
Some may believe that more thorough planning is the key to overcoming unsuccessful production deployments. However, based on my experience, I can tell you that no amount of planning will overcome issues that are simply not apparent until you actually see how services perform in the real world.
Instead, one must architect for agility. One of the most important steps in this pursuit is to move production deployment up further in the process, while deploying in small increments, so that you get early and actionable feedback about how your services are going to perform. In keeping with our belief in the agile method, moving up deployment activities allows you to iterate much more quickly and frequently. It also alleviates some of the risk of individual tasks having a big impact on the timeline since you do not need to worry about releasing everything in “one big bang.”
Opportunities for Further Improvement
Essential to the notion of agile deployments is providing the capabilities that enable teams to more effectively deploy early and often.
At Silicon Valley Data Science we are investing in several ongoing initiatives that are designed to enable faster deployments and shorter feedback cycles:
- “Push Button” infrastructure builds will allow us to quickly spin up and tear down ‘production-like’ clusters at will, enabling us to test capabilities without fear of running up costs
- Monitoring frameworks will allow us to refine data storage and performance requirements through the development processes, as well as proactively diagnose and take action on production issues in real time
- Automated test suites will allow us to deploy new features and automatically regression test existing ones to ensure we do not impact stable code
These capabilities will allow our developers to immediately test features in production-like environments as they are developed and to obtain advance notice about areas where processes may have defects or may bottleneck on performance. If problems are discovered they can be addressed immediately before they become a real issue in production.
We are very excited about the development of these capabilities and look forward to sharing our progress as our work unfolds. We also invite the community to share its ideas on other best practices and initiatives in the comments below.