Enough Data Engineering for a Data Scientist

Enough Data Engineering for a Data Scientist is a special event being hosted by The Data Lab at the beautiful Balmoral Hotel in Edinburgh. If you’re not able to attend, please download the slides!

Friday, June 23

How I Learned to Stop Worrying and Love the Data Scientists

So how much data engineering should a Data Scientist know?

For a Data Scientist to get to the fun part of their job, they normally have to do a bit of data engineering—like onboarding data or doing a little bit of “wrangling”—before they get to the fun part part: the data science! In most cases, this is 50–80% of the work.

Then comes handing it over to the Data Engineering team to put it into production (of course via dev, test, and QA). This is when a “little bit” of contention happens as, in most cases, the Data Engineering team will have to do “some” modification/re-write/head-shaking/hand-wringing to get the code to be production ready and meet the SLA’s defined by the business, as there is a disconnect in how Data Scientists and Data Engineers develop code/models.

In this talk Stephen will take the Data Scientist on a journey: from on-boarding data, and how different data/object stores can help; to understanding and choosing the right data format for the data assets; exploring some different query engines, and some basic query tuning for each; explaining how a distributed streaming platform works, and how you can take advantage of it; and lastly covering some good coding practices. This will give the Data Scientist new skills to help them be more productive, so that can get to the fun part faster.

Plus reduce the contention with the Data Engineering team, and make them say, “How I Learned to Stop Worrying and Love the Data Scientists”!

The topics that are going to be covered include:

On-Boarding Data

  • Load into Data/Object Stores
  • Load into Memory
  • Partition Strategies

Data Formats

  • Text
  • Avro
  • Parquet
  • ORC
  • Schema Evolution

Query Engines

  • Initial “Create Table”
  • Hive
  • Impala
  • Presto
  • Spark SQL
  • Explain Plan on SQL / SQL Tuning

Distributed Message Bus/Streaming Platform

  • Stream Processing
  • Partition Strategies

Good Coding Practices

  • Source Control
  • Unit Tests
  • Continuous Integration
  • Catching Errors
  • Alerting & Monitoring