Enough Data Engineering for a Data Scientist

Name: Enough Data Engineering for a Data Scientist
Start: 2017-06-23T09:00:00+01:00
End: 2017-06-23T13:00:00+01:00
Location: The Balmoral Hotel

Enough Data Engineering for a Data Scientist is a special event being hosted by The Data Lab at the beautiful Balmoral Hotel in Edinburgh. If you’re not able to attend, please download the slides!

Friday, June 23

How I Learned to Stop Worrying and Love the Data Scientists

9:30–11:00am

Stephen O'Sullivan

So how much data engineering should a Data Scientist know?

For a Data Scientist to get to the fun part of their job, they normally have to do a bit of data engineering—like onboarding data or doing a little bit of “wrangling”—before they get to the fun part part: the data science! In most cases, this is 50–80% of the work.

Then comes handing it over to the Data Engineering team to put it into production (of course via dev, test, and QA). This is when a “little bit” of contention happens as, in most cases, the Data Engineering team will have to do “some” modification/re-write/head-shaking/hand-wringing to get the code to be production ready and meet the SLA’s defined by the business, as there is a disconnect in how Data Scientists and Data Engineers develop code/models.

In this talk Stephen will take the Data Scientist on a journey: from on-boarding data, and how different data/object stores can help; to understanding and choosing the right data format for the data assets; exploring some different query engines, and some basic query tuning for each; explaining how a distributed streaming platform works, and how you can take advantage of it; and lastly covering some good coding practices. This will give the Data Scientist new skills to help them be more productive, so that can get to the fun part faster.

Plus reduce the contention with the Data Engineering team, and make them say, “How I Learned to Stop Worrying and Love the Data Scientists”!

The topics that are going to be covered include:

On-Boarding Data

Load into Data/Object Stores
Load into Memory
Partition Strategies

Data Formats

Text
Avro
Parquet
ORC
Schema Evolution

Query Engines

Initial “Create Table”
Hive
Impala
Presto
Spark SQL
Explain Plan on SQL / SQL Tuning

Distributed Message Bus/Streaming Platform

Stream Processing
Partition Strategies

Good Coding Practices

Source Control
Unit Tests
Continuous Integration
Catching Errors
Alerting & Monitoring

Enough Data Engineering for a Data Scientist

Friday, June 23

How I Learned to Stop Worrying and Love the Data Scientists

Customer Knowledge

Customer Knowledge

Home

Sign In