Scaling Data Science: Dream Big, Start Medium-ish

Meetup Recap | August 11th, 2016

On July 13th we welcomed the Open Data Science Conference meetup series to our HQ for the second time. ODSC believes that open source software (OSS) principles can accelerate data science knowledge, and we think pretty highly of OSS here at SVDS. We’ll be at ODSC’s next conference this November in Santa Clara.

How much data do you have?

This meetup featured a talk by Dr. Brian Spiering, of Galvanize¹.

His thesis is that you probably don’t have big data, so don’t invite headaches. What does “big data” mean here, though? Spiering presented a simplified definition: “It doesn’t fit on a single machine.” While clearly an oversimplification (we still remember cluster computing), for a data scientist used to single-machine computing, it’s a simple metric to help you judge if your dataset is starting to tip the scale.

With this definition in mind, we can ask two common questions:

What are the implications for doing data science when you have a lot of data?
What technologies and techniques do you need?

Because these are practical questions and the answers are always evolving, it’s difficult to answer without timely and direct experience working on large-scale data platforms. Spiering sought to give our audience some rules of thumb to help orient them about the various “scales” of data science.

He asserted that most data science problems use small and medium-sized data—something that fits on a large cloud node (perhaps using multiple cores with ipyparallel) and/or can be processed out-of-core (e.g. using something like Dask). We’d also remind you that, if one is available, a better algorithm will beat better hardware.

When it comes to utilizing larger amounts of data, put some time into some thoughtful infrastructure design and establish good data hygiene. At the beginning, this can be as simple as a Makefile and scripts that can reproducibly collect, clean, preprocess, and otherwise prepare a manageable dataset for modeling. This approach does not work as data gets larger, but it does make the transition to data prep tools like Luigi and Airflow easier.

But what happens when the training set is still too big? This can be a big jump—scikit-learn is likely no longer an option—which is why you should be deliberate in your data usage choices. Do you need three years’ worth of transaction-level purchase data in memory to analyze this season’s style trends? Maybe the answer is yes, but make sure it’s a conscious choice. If you need those larger sets, look to tools like Ibis (from the creator of pandas) and Spark to make your transition to working with larger data sets easier.

The promise of Ibis is that it lets you interact with the Hadoop-scale query engine Impala directly inside Python, creating queries and calculations which run on Impala, and offers fast read/write access and pandas support. Spark is a scalable compute framework that has its own library of machine learning algorithms, so you’ll need to learn a new API and workflow. On the plus side, you’ll be even more capable of tackling whatever problem comes your way.

Have you encountered any issues scaling your data science? Let us know in the comments, or get in touch on Twitter.

What’s next?

ODSC has several conferences in the coming months; we’d love to say hello in San Francisco this November.

^{1. Our own Eric White also spoke, and his talk will be covered in a future post.↩}