It’s been hard to miss Apache Spark in the last year. The hottest star in the big data world has brought startup Databricks to the public eye, and has been quickly embraced by Hadoop distributors Cloudera and Hortonworks. Many systems integrators, including ourselves, have also been enthusiastic about it. Here at Silicon Valley Data Science, we’re actively using Spark on multiple customer projects.
Why has this technology taken off so fast, and what does it mean for businesses using big data?
1. Spark enables use cases “traditional” Hadoop can’t handle
Using in-memory distributed computing, Spark provides capabilities over and above the batch model of Hadoop MapReduce: streaming processing, machine learning, graph computing, and interactive analytics. This brings to the big data world new applications of data science that were previously too expensive or slow on massive data sets.
2. Spark is fast
Spark can run analytics orders of magnitude faster than existing Hadoop deployments. This means more interactivity, faster experimentation, and increased productivity for analysts.
3. Spark can use your existing big data investment
When Hadoop came along, businesses invested in new compute clusters to use the technology. Spark doesn’t have that type of barrier: it can be used on top of existing Hadoop investments to quickly realize new features. What’s more, Spark is highly compatible with the Hadoop universe: it can use data in HDFS, and run under Hadoop 2.0’s YARN. As well as Hadoop, Spark can work with Cassandra and Amazon’s S3 storage.
4. Spark speaks SQL
SQL is the lingua franca of the structured data world, and Spark’s SQL module means existing data sources, including Hive, can be brought into a computation, and that existing investments in BI tools can be leveraged over big data. Spark SQL is less mature than other big data SQL implementations, but coming on strong.
5. Spark is developer-friendly
Never underestimate the power of a technology that developers find easy to use. Despite the fact that Spark is based on a relatively new programming language, Scala, developers enjoy the concise and fluid way it can be programmed. The language of Hadoop—Java—is also supported, as is the darling language of the data scientist, Python.
Spark at Silicon Valley Data Science
Spark’s power means it has quickly found multiple use cases in large enterprises despite its being a relatively new technology. At SVDS, we’ve implemented Spark in several situations, including:
- managing a major retailer’s inventory across a diverse network of entities in near real time;
- managing and processing event streams for online gaming; and
- supporting data science initiatives across massive data sets at a media analytics company.
We’re happy to be a sponsor of Spark Summit in New York this week. If you’re going, please visit us at table 27 and come say hello: we’ve plenty of real world Spark experience we’d love to share our thoughts.