As a buzzword, the phrase “big data” summons many things to mind, but to understand its real potential, look to the businesses creating the technology. Google, Facebook, Microsoft, and Yahoo are driven by very large customer bases, a focus on experimentation, and a need to put data science into production. They need the ability to be agile, while still handling diverse and sizable data volumes.
The resulting set of technologies, centered around cloud and big data, have brought us a set of capabilities that can equip any business with the same flexibility, which is why the real benefit of big data is agility.
We can break this agility down into three categories, spread over purchasing and resource acquisition, architectural factors, and development.
Linear scale-out cost. A significant advantage of big data technologies such as Hadoop is that they are scalable. That is, when you add more data, the extra cost of compute and storage is approximately linear with the increase in capacity.
Why is this a big deal? Architectures that don’t have this capability will max out at a certain capacity, beyond which costs get prohibitive. For example, NetApp found that in order to implement telemetry and performance monitoring on their products, they needed to move to Hadoop and Cassandra, because their existing Oracle investment would have been too expensive to scale with the demand.
This scalability means that you can start small, but you won’t have to change the platform when you grow.
Opex vs capex. Many big data and data science applications use cloud services, which offer a different cost profile to owning dedicated hardware. Rather than getting lumbered with a large capital investment, using the cloud makes compute resource into an operational cost. This opens up new flexibility. Many tasks, such as large, periodic Extract-Load-Transform (ETL) processes, just don’t require compute power 24/7, so why pay for it? Additionally, data scientists now have the ability to leverage the elasticity of cloud resources: perhaps following up a hypothesis needs 1,000 compute nodes, but just for a day. That was never possible before the cloud without a huge investment: certainly not one anybody would have made for one experiment.
Ease of purchase. A little while ago I was speaking to a CIO of a US city, and we were discussing his use of Amazon’s cloud data warehouse, Redshift. Curious, I inquired which technical capability had attracted him. It wasn’t a technical reason: it turned out he could unblock a project he had by using cloud services, rather than wait three months for a cumbersome purchase process from his existing database company.
And it’s not just the ability to use cloud services that affects purchase either: most big data platforms are open source. This means you can get on immediately with prototyping and implementation, and make purchase decisions further down the line when you’re ready for production support.
Schema on read. Hadoop turned the traditional way of using analytic databases on its head. When compute and storage are at a premium, the traditional Extract-Transform-Load way of importing data made sense. You optimized the data for its application—applied a schema—before importing it. The downside there is that you are stuck with those schema decisions, which are expensive to change.
The plentiful compute and storage characteristics of scale-out big data technology changed the game. Now you can pursue Extract-Load-Transform strategies, sometimes called “schema on read.” In other words, store data in its raw form, and optimize it for use just before it’s needed. This means that you’re not stuck with one set of schema decisions forever, and it’s easier to serve multiple applications with the same data set. It enables a more agile approach, where data can be refined iteratively for the purpose at hand.
Rapid deployment. The emergence of highly distributed commodity software has also necessitated the creation of tools to deploy software to many nodes at once. Pioneered by early web companies such as Flickr, the DevOps movement has ensured that we have technologies to safely bring new versions of software into service many times a day, should we wish. No longer do we have to make a bet on three months into the future with software releases, but new models and ways of processing data can be introduced—and backed out—in a very flexible manner.
Faithful development environments. One vexing aspect of development, exasperated by deploying to large server clusters, is the disparity between the production environment software runs in, and the environment in which a developer or data scientist creates it. It’s a source of continuing deployment risk. Advances in container and virtualization technologies mean that it’s now much easier for developers to use a faithful copy of the production environment, reducing bugs and deployment friction. Additionally, technologies such as notebooks make it easier for data scientists to operate on an entire data set, rather than just a subset that will fit on their laptop.
Fun. Human factors matter a lot. Arcane or cumbersome programming models take the fun out of developing. Who enjoys SQL queries that extend to over 300 lines? Or waiting an hour for a computation to return? One of the key advantages of the Spark analytical project is that it is an enjoyable environment to use. Its predecessor, Hadoop’s MapReduce, was a lot more tedious to use, despite the advances that it brought. The best developers gravitate to the best tools.
Concision. As big data technologies advance, the amount of code required to implement an algorithm has shrunk. Early big data programs needed a lot of boilerplate code, and their structures obscured the key transformations that the program implemented. Concise programming environments mean code is faster to write, easier to reason about, and easier to collaborate over.
Easier to test. When code moves from targeting a single machine to a scaled computing environment, testing becomes difficult. The focus on testing that has come from the last decade of agile software engineering is now catching up to big data, and Spark in particular incorporates testing capabilities. Better testing is vital as data science finds its way as part of production systems, not just as standalone analyses. Tests enable developers to move with the confidence that changes aren’t breaking things.
Any technology is only as good as the way in which you use it. Successfully adopting big data isn’t just about large volumes of data, but also about learning from its heritage—those companies which are themselves data-driven.
The legacy of big data technologies is an unprecedented business agility: for creating value with data, managing costs, lowering risk, and being able to move quickly into new opportunities.
Editor’s note: Our CTO John Akred will be at Strata + Hadoop World Singapore next week, talking about the business opportunities inherent in the latest technologies, such as Spark, Docker, and Jupyter Notebooks.