5 Reasons Why Spark Matters to Business

March 17th, 2015

It’s been hard to miss Apache Spark in the last year. The hottest star in the big data world has brought startup Databricks to the public eye, and has been quickly embraced by Hadoop distributors Cloudera and Hortonworks. Many systems integrators, including ourselves, have also been enthusiastic about it. Here at Silicon Valley Data Science, we’re actively using Spark on multiple customer projects.

Why has this technology taken off so fast, and what does it mean for businesses using big data?

1. Spark enables use cases “traditional” Hadoop can’t handle

Using in-memory distributed computing, Spark provides capabilities over and above the batch model of Hadoop MapReduce: streaming processing, machine learning, graph computing, and interactive analytics. This brings to the big data world new applications of data science that were previously too expensive or slow on massive data sets.

2. Spark is fast

Spark can run analytics orders of magnitude faster than existing Hadoop deployments. This means more interactivity, faster experimentation, and increased productivity for analysts.

3. Spark can use your existing big data investment

When Hadoop came along, businesses invested in new compute clusters to use the technology. Spark doesn’t have that type of barrier: it can be used on top of existing Hadoop investments to quickly realize new features. What’s more, Spark is highly compatible with the Hadoop universe: it can use data in HDFS, and run under Hadoop 2.0’s YARN. As well as Hadoop, Spark can work with Cassandra and Amazon’s S3 storage.

4. Spark speaks SQL

SQL is the lingua franca of the structured data world, and Spark’s SQL module means existing data sources, including Hive, can be brought into a computation, and that existing investments in BI tools can be leveraged over big data. Spark SQL is less mature than other big data SQL implementations, but coming on strong.

5. Spark is developer-friendly

Never underestimate the power of a technology that developers find easy to use. Despite the fact that Spark is based on a relatively new programming language, Scala, developers enjoy the concise and fluid way it can be programmed. The language of Hadoop—Java—is also supported, as is the darling language of the data scientist, Python.

Spark at Silicon Valley Data Science

Spark’s power means it has quickly found multiple use cases in large enterprises despite its being a relatively new technology. At SVDS, we’ve implemented Spark in several situations, including:

managing a major retailer’s inventory across a diverse network of entities in near real time;
managing and processing event streams for online gaming; and
supporting data science initiatives across massive data sets at a media analytics company.

We’re happy to be a sponsor of Spark Summit in New York this week. If you’re going, please visit us at table 27 and come say hello: we’ve plenty of real world Spark experience we’d love to share our thoughts.

1. Spark enables use cases “traditional” Hadoop can’t handle

2. Spark is fast

3. Spark can use your existing big data investment

4. Spark speaks SQL

5. Spark is developer-friendly

Spark at Silicon Valley Data Science

Two Tips for Optimizing Hive

Thank You

Better Know the Districts

Welcome to Silicon Valley Data Science

Sign In