Spark Summit 2017

Name: Spark Summit 2017
Start: 2017-06-05T00:00:00-07:00
End: 2017-06-07T23:59:59-07:00
Location: Moscone Center

Join us at our Spark Summit sessions in San Francisco, where we’ll be giving a tutorial on data platforms, as well as sessions on PySpark and Graph Algorithms. Find CTO John Akred, VP of Engineering Stephen O’Sullivan, or Principal Data Engineer and Spark Contributor Andrew Ray to talk more.

Monday, June 5

Architecting a Data Platform

9:00am-6:00pm in Room 2008

John AkredStephen O'SullivanAndrew Ray

What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop, Spark and big data ecosystem fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

By tracing the flow of data from source to output, we’ll explore the options and considerations for components, including:

Acquisition: from internal and external data sources
Ingestion: offline and real-time processing
Storage
Analytics: batch and interactive
Providing data services: exposing data to applications

We’ll also give advice on:

tool selection
the function of the major Hadoop components and other big data technologies such as Spark and Kafka
integration with legacy systems

Wednesday, June 7

Data Wrangling with PySpark for Data Scientists Who Know Pandas

11:40am-12:10pm in Room 2020

Andrew Ray

Data scientists spend more time wrangling data than making models. Traditional tools like Pandas provide a very powerful data manipulation toolset. Transitioning to big data tools like PySpark allows one to work with much larger datasets, but can come at the cost of productivity.

In this session, learn about data wrangling in PySpark from the perspective of an experienced Pandas user. Topics will include best practices, common pitfalls, performance consideration and debugging.

Write Graph Algorithms Like a Boss

3:20pm-3:50pm in Room 2020

Andrew Ray

Graph-parallel algorithms such as PageRank operate on an entire graph at once. Efficient distributed implementations of these algorithms are important at scale. This session will introduce the two main abstractions for these types of algorithms: Pregel and PowerGraph.

Explore how GraphX combines the best of both abstractions and walk through multiple example algorithms. Note: Familiarity with Apache Spark and basic Graph concepts is expected.