Seattle skyline

Data Day Seattle 2016

Several of us will be at Data Day Seattle this year, and we’d love a chance to say hi!

Saturday, July 23

What's Your Data Worth?


The unique properties of data make assessing its value difficult when using the traditional approaches of intangible asset valuation. In this talk, John Akred will discuss a number of alternative approaches to valuing data within an organization for specific purposes, including informing decisions to purchase third party data, and monitoring data’s value internally to manage and increase that value over time. Data is difficult to value in large part because, economically, it does not adhere to the three main conditions of a traditional market system. In addition, traditional valuation methods of intangible assets do not apply to data valuation.

John will give several examples of how to use methods such as the Value of Information (VOI) framework and A/B testing to assess whether or not a third party data source should be purchased or continue to be purchased. He will also show how mutual information (MI) can be used to assess value of a data source once it is in use within the organization. Lastly, he will discuss the qualities that make data more valuable within an organization, and provide a range of concrete and straightforward metrics that allow the value of data to be monitored internally to ensure that business decisions can be optimized to maximize that value over time.

Catching trains: Iterative model development with Jupyter Notebook


Jupyter notebooks have become a highly valued medium for data scientists and researchers to explore data and develop models. Their ability to provide immediate feedback and act as an easy and integrated record of what was done can be used to effectively communicate work and results to a variety of audiences. However, the ad hoc nature of the notebook can often result in a code free-for-all, leading to poor code quality that results in limited reusability and reproducibility without serious rework.

This talk will step through the development of a multi-step algorithm created to detect the passing of a train and its direction from video to demonstrate best practices and tips for developing and iterating models within Jupyter notebooks. These best practices include maintaining a productive workflow for data scientists—one that both maximizes reproducibility, and allows for effective communication.

Data Pipelines with Kafka and Spark


Spark and Kafka have emerged as a core part of distributed data processing pipelines. This tutorial will explain how Spark, Kafka and rest of the big data ecosystem fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads. By examining use cases and architectures, we’ll trace the flow of data from source to output, and explore the options and considerations for each stage of the pipeline.