Enterprise Data World 2017

Enterprise Data World focuses on data-driven business. Several of us will be there this year, talking about data platforms and enterprise data science. If you can’t make it to Atlanta, you can sign up for our slides on this page.

Monday, April 3

Architecting a Big Data Platform


What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop, Spark, and big data ecosystem fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.
By tracing the flow of data from source to output, we’ll explore the options and considerations for components, including:

  • Acquisition: from internal and external data sources
  • Ingestion: offline and real-time processing
  • Storage
  • Analytics: batch and interactive
  • Providing data services: exposing data to applications

We’ll also give advice on:

  • Tool selection
  • The function of the major Hadoop components and other big data technologies such as Spark and Kafka
  • Integration with legacy systems

Managing Data Science in the Enterprise


Organizing around data is a concern for the whole business. The myth of the lone ranger data scientist is very much that; effectively leveraging data requires cross-functional collaboration, organizational adaptation, and an organizational understanding of what using data to create business value entails.

In this tutorial, we will share our methods and observations from three years of effectively deploying data science in enterprise organizations. Attendees will learn how to build, run, and get the most value from data science teams and how to work with and plan for the needs of the business.


  • Data science in the enterprise
  • Building a data-driven culture
  • Organizational concerns for data science
  • Data science techniques
  • Methods for running a data science project
  • Hiring and managing data scientists
  • Tools and platforms
  • Deploying data science: from the lab to the factory
  • Data science maturity models

Wednesday, April 5

Instant and Repeatable Data Platforms


Configuring a data platform and data science environment can be a tedious, error-prone process including development, continuous integration, QA, staging and production, and often has to be configured from scratch. By combining cloud platforms such as AWS or Azure with Terraform and Ansible, we can create a repeatable data science infrastructure.
In this talk, we’ll discuss our “push button” infrastructure tool and how attendees can use it in their own projects to create a cloud-agnostic environment that spins up quickly and is easy to configure as required.

We will cover:

  • Use cases, such as the ability to bring up the same cluster repeatedly, or disaster recovery
  • How to parameterize your cloud environment
  • Creating a data lab for the data scientist, with all the tools they require for their exploration
  • The development and release process, including integration testing
  • How to model costs in real-time to analyze price and desired performance