Strata + Hadoop World San Jose 2016

Many of us will be at the Strata Conference + Hadoop World 2016 in San Jose, and we’d love to see you there!

Get the slides!

If you’re not able to attend the conference, or if you miss one of our sessions at Strata + Hadoop World, sign up with your email here, and we’ll send you copies of all our slides after the conference is over. If you sign up before the conference, we’ll also send you our Data Strategy position paper, which is great companion reading if you’re attending any of our talks.

Visit us at booth 737 in the Expo Hall to meet us and check out some of the R&D we are doing. And please join us for our tutorial and presentation sessions:

Tuesday, March 29

Architecting a Data Platform

9:00am–12:30pm in Room LL21 C/D
John Akred, Stephen O’Sullivan & Gary Dusbabek

What are the essential components of a data platform? John Akred and Stephen O’Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

By tracing the flow of data from source to output, John and Stephen explore the options and considerations for components, including:

Acquisition: from internal and external data sources
Ingestion: offline and real-time processing
Storage
Analytics: batch and interactive
Providing data services: exposing data to applications

The Business Case for Spark, Kafka, and Friends

9:30–10:00am in Room LL20 B
Edd Dumbill

Spark is white-hot at the moment, but why does it matter? The secret power of big data technologies is that they promote flexible development patterns and economic scaling and are ready to adapt to business needs—but years of focusing on the label “big” has obscured much of the value to those approaching the topic. Skepticism and hype-fatigue are understandable reactions.

Developers are usually the first to understand why some technologies cause more excitement than others. Edd Dumbill relates this insider knowledge, providing a tour through the hottest emerging data technologies of 2016 to explain why they’re exciting in terms of both new capabilities and the new economies they bring. Edd explores the emerging platforms of choice and explains where they fit into a complete data architecture and what they have to offer in terms of new capabilities, efficiencies, and economies of use.

Developing a Modern Enterprise Data Strategy

1:30–5:00pm in Room LL21 C/D
Edd Dumbill & Scott Kurth

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technical solutions? Fundamentally, data should serve the strategic imperatives of a business—those key strategic aspirations that define the future vision for an organization. A data strategy should guide your organization in two key areas: what actions your business should take to get started with data and where to realize the most value.

Edd Dumbill and Scott Kurth explain how to solve real business challenges with data.

Topics include:

Why have a data strategy?
Connecting data with business
Devising a data strategy
The data value chain
New technology potentials
Project development style
Organizing to execute your strategy

How to Eat Change for Breakfast: Building an Experimental Enterprise

4:10—4:55pm in Room 211 C
Sanjay Mathur
NOTE: this talk is part of the concurrent Cultivate conference.

The world of data—from practitioner skill sets to consumer assumptions and even employee expectations—is changing. Our current and potential consulting clients recognize this: the recruits they’re talking to are eager to employ data in their decision-making processes. To truly take advantage of this fact—to thrive—your business must adapt.

An experimental enterprise is, fundamentally, an organization that thrives on change and uses data as a catalyst. Becoming an experimental enterprise means reshaping the way you and your company understand things like failure, the role of technology, and your own gut instinct. But the benefits are fantastic learning and growth. Sanjay Mathur offers three key questions to ask yourself and three pitfalls to avoid along the way.

Thursday, March 31

Format Wars: From VHS and Beta to Avro and Parquet

1:50-2:30pm in Room 230 A
Silvia Oliveros & Stephen O’Sullivan

Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance.

Silvia Oliveros and Stephen O’Sullivan cover the four major data formats (plain text, SequenceFile, Avro, and Parquet) and provide insight into what they are and how to best use and store them in HDFS. Each of the data formats has different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, Silvia and Stephen have observed performance differences on the order of 25x between Parquet and plain text files for certain workloads. However, it isn’t the case that one is always better than the others.

Drawing from a few real-world use cases, Silvia and Stephen cover the hows, whys, and whens of choosing one format over another and take a closer look at some of the tradeoffs each offers.

Ask Us Anything: Developing a modern enterprise data strategy

4:20–5:00pm in Room 211 A–C
John Akred, Scott Kurth & Colette Glaeser

The team behind the tutorial “Developing a modern enterprise data strategy,” field a wide range of detailed questions. Even if you don’t have a specific question, join in to hear what others are asking.