Wednesday, July 27
What’s New in Spark 2.0: Structured Streaming and Datasets
Spark 2.0 includes many exciting new features including Structured Streaming, and the unification of Datasets (new in 1.6) with DataFrames. Structured Streaming allows one to define recurrent queries on a stream of data that is handled as an infinite DataFrame. This query is incrementally updated with new data. This allows for code reuse between batch and streaming and an easier logical model to reason about. Datasets, an extension of DataFrames, were added as an experimental feature in Spark 1.6. They allow us to manipulate collections of objects in a type-safe fashion. In Spark 2.0 the two abstractions have been unified and now DataFrame = Dataset[Row]. We will discuss both of these new features and look at practical real world examples.
To get the most out of this talk some familiarity with Spark is expected.