Ben Everson

Ben has a background in computational biophysics, where he leveraged complex bioinformatic data to build functional proteins from scratch. He is excited about applying advanced mathematical modeling and machine learning techniques to test hypotheses and deliver elegant solutions to the most difficult problems.

Recent Posts

Analyzing Sentiment in Caltrain Tweets

Analyzing Sentiment in Caltrain Tweets

As a first step to using Twitter activity as one of the data sources for train prediction, we start with a simple question: How do Twitter users currently feel about Caltrain?

predictive maintenance IoT

Getting Started with Predictive Maintenance Models

In this post, we’ll start to develop an intuition for how to approach the remaining useful life (RUL) estimation problem and take the first steps in modeling RUL.

Predictive Maintenance for IoT

In this post, we’ll cut through some of the ambiguity around IoT applications, and introduce an example data science problem relevant to the IoT world.

Driving Product Engagement with User Behavior Analytics

In this post, we will look at driving product engagement with behavioral data, as well as building an integrated analytical environment.

Data-Driven User Engagement

The promise of data and analytics for product companies is that they can help you understand usage, and improve your ability to build, deploy, and service products to customers much more accurately and efficiently. In this post, we look at understanding the customer life cycle.

Jeffrey Yau

An expert in quantitative modeling, Jeffrey brings over 17 years of experience applying econometric, statistic, and mathematical modeling techniques to real-world challenges. He brings a strong background in finance as well as a passion for leading data science teams in finding innovative solutions to challenging business problems.

Past Events

2017

  • Strata + Hadoop World CA 2017

    San Jose, CA

    Many of us will be at Strata in San Jose, and we’d love to see you there! Come learn more about data platforms, data strategy, business tools, and more.

2016

  • TDWI San Diego 2016

    San Diego, CA

    CTO John Akred, VP of Data Science Jeffrey Yau, and Senior Data Scientist Cindi Thompson will teach a three-hour tutorial in which they will share our methods and observations from three years of effectively deploying data science in enterprise organizations. Attendees will learn how to be an effective member or manager of a data science team.

  • Enterprise Dataversity 2016

    Chicago, IL

    Several of us will be in Chicago this year, presenting tutorials on data strategy, data platforms, and how to manage data science in the enterprise. CTO John Akred will also be taking part in a panel about how to strengthen your data strategy skills.

  • PyData San Francisco 2016

    San Francisco, CA

    We’ll be at PyData, looking to learn more about how data scientists are using Python. Have a cool story, or questions of your own? Be sure to come find us.

  • DataEDGE 2016

    Berkeley, CA

    VP of Data Science Jeffrey Yau, along with Data Scientists Chloe Mawer and Daniel Margala, will be presenting on predicting train delays. See more about our train work here.

2015

  • Data Dialogs Conference

    Berkeley, CA

    Come hear from practitioners and scholars who are designing new data science tools and applications and using data in meaningful and beneficial ways, including CTO John Akred and VP of Data Science Jeffrey Yau.

  • DataPalooza

    San Francisco, CA

    Join us for a demo and talk on our Caltrain Rider app, which presents an intuitive view of the Caltrain systems using data from our own sensors (video, audio) combined with publicly available data from Twitter and the Caltrain API.

Stephen O’Sullivan

A leading expert on big data architecture, infrastructure, and technical operations, Stephen brings over 20 years of experience creating scalable, high-availability, data and applications solutions. He has deep experience in Hadoop usage and architecture and other cutting-edge open source solutions.

Prior to SVDS, Stephen created and led the next-generation data platform team at Walmart Labs as a Senior Director. He and his team architected and designed the data platform used by all of Walmart’s e-commerce business units. Stephen evaluated, made recommendations and built solutions to address Walmart’s needs in security, high availability, scalability and performance. Prior to Walmart Labs, his tenure covers over three years as Director of System Engineering at LiveOps, running data, reporting, and API teams; three years in data management at Yahoo’s search group; and eleven years as a Senior Engineer at Sun Microsystems.

Stephen is a patent-holder and a sought-after presenter at major industry conferences.

Recent Posts

The Data Platform Puzzle

Building or rebuilding a data platform can be a daunting task, as most questions that need to be asked have open-ended answers. But that doesn’t mean you have to guess and use your gut.

SVDS at Strata San Jose 2016

Several of our presenters were interviewed at Strata San Jose. If you missed the conference, check out these interviews below to catch up on some of the topics that were on our minds.

The Data Platform Puzzle

Building or rebuilding a data platform can be a daunting task, as most questions that need to be asked have open-ended answers. This post aims to help.

Data Day and Graph Day Texas Slides

Check out the slides from our recent presentations at Data Day TX and Graph Day.

Storing and Visualizing Time Series with Graphite

Graphite is a tool that does two things rather well: storing numeric time-series data (metric, value, epoch timestamp), and rendering graphs of this data on demand.

Past Events

2017

  • Data Architecture Summit 2017

    Chicago, IL

    The Data Architecture Summit provides in-depth education from leading experts specializing in data architecture. We will be there discussing data platform and data governance. Let us know if you you’ll be attending and would like to chat.

  • Strata Data Conference New York 2017

    New York, NY

    The Strata Data Conference is where cutting-edge science and new business fundamentals intersect—and merge. Several of us will be there in September, discussing platforms, strategy, and tools. Let us know if you’ll be attending and would like to chat.

  • Enough Data Engineering for a Data Scientist

    Scotland

    Stephen O’Sullivan, VP of Data Engineering at SVDS, is the featured guest speaker at this event hosted by The Data Lab at the Balmoral Hotel.

  • Spark Summit 2017

    San Francisco

    Join us at our Spark Summit sessions in San Francisco, where we’ll be giving a tutorial on data platforms, as well as sessions on PySpark and Graph Algorithms. Find CTO John Akred, VP of Engineering Stephen O’Sullivan, or Principal Data Engineer and Spark Contributor Andrew Ray to talk more.

  • Strata + Hadoop World London 2017

    London

    Strata + Hadoop World London focuses on how to make data-driven decisions across industries. Several of us will be there in May, discussing platforms, strategy, and tools. Let us know if you’ll be attending and would like to chat.

  • Red Hat Summit 2017

    Boston, MA

    This year we’re teaming up with Red Hat to discuss analytics frameworks, and how you can get the most out of your data.

  • DataEngConf 2017

    San Francisco, CA

    DataEngConf features talks and workshops aimed at bridging the gap between data scientists, data engineers, and data analysts. We’ll be there, giving tips on choosing the right format for your data.

  • Enterprise Data World 2017

    Atlanta, GA

    Enterprise Data World focuses on data-driven business. Several of us will be there this year, talking about data platforms and enterprise data science. Let us know if you’ll be there, or you can sign up to receive our slides.

  • Strata + Hadoop World CA 2017

    San Jose, CA

    Many of us will be at Strata in San Jose, and we’d love to see you there! Come learn more about data platforms, data strategy, business tools, and more.

  • Data Day Texas 2017

    Austin, TX

    We’re heading to Texas in January to talk about data pipelines with Kafka and Spark.

2016

  • Strata + Hadoop World New York 2016

    New York, NY

    The SVDS crew will be in New York this year, talking about data platforms, data strategy, and making the business case for Spark. Come by our talks, or catch us in the hallway track.

  • Enterprise Dataversity 2016

    Chicago, IL

    Several of us will be in Chicago this year, presenting tutorials on data strategy, data platforms, and how to manage data science in the enterprise. CTO John Akred will also be taking part in a panel about how to strengthen your data strategy skills.

  • Data Day Seattle 2016

    Seattle, WA

    Join us as CTO John Akred gives a talk on alternative approaches to valuing data within an organization, and Data Scientist Chloe Mawer demonstrates the power of Jupyter notebooks using a real-world train-detection problem. We’ll also present a tutorial on building data pipelines with Kafka and Spark.

  • Strata + Hadoop World London 2016

    London

    Several of us will be presenting, talking about platforms, strategies, and tools. We’ve love to see you there! Join us for our tutorials and sessions, or come ask questions during our Office Hour.

  • Enterprise Data World San Diego 2016

    San Diego, CA

    Several of us will be at Enterprise Data World 2016 in San Diego. We’d love to say hi, and hear your thoughts.

  • Strata + Hadoop World

    San Jose, CA

    Many of us will be at the Strata Conference + Hadoop World 2016 in San Jose, and we’d love to see you there!

  • Data Day Texas

    Austin, TX

    Join CTO John Akred for a talk on Running Agile Data Science Teams, and VP of Engineering Stephen O’Sullivan for a talk on Choosing an HDFS data storage format (Avro vs. Parquet). Principal Engineer Mark Mims will hold Office Hours.

2015

  • DataPalooza

    San Francisco, CA

    Join us for a demo and talk on our Caltrain Rider app, which presents an intuitive view of the Caltrain systems using data from our own sensors (video, audio) combined with publicly available data from Twitter and the Caltrain API.

  • Enterprise Dataversity

    Chicago, IL

    SVDS presents two tutorials: one on Data Strategy and one on building a Data Platform. In addition, Edd Dumbill participates in a panel, “The Data-Driven Organization – New Roles and Relationships.”

  • Strata + Hadoop World NY 2015

    New York, NY

    Several of us will be at the Strata + Hadoop World 2015 Conference in New York in September and we’d love to see you there. Join us for our tutorials and sessions, or come visit us at our booth in the Expo Hall.

  • NoSQL Now!

    San Jose, CA

    What are the essential components of a data platform? SVDS presents a tutorial that will explain how the various parts of the Hadoop and big data ecosystem fit together in production to create a data platform supporting batch, interactive and real-time analytical workloads.

  • StampedeCon

    St. Louis, MO

    SVDS presents two sessions at StampedeCon: one that examines the benefits of using multiple persistence strategies to build an end-to-end predictive engine; and a look at how to choose an HDFS data storage format: Avro vs. Parquet and more.

  • Hadoop Summit

    San Jose, CA

    SVDS presents two sessions at Hadoop Summit: one that maps the central concepts in Spark to those in the SAS language, including datasets, queries, and machine learning; and a look at how to choose an HDFS data storage format: Avro vs. Parquet and more.

  • Strata + Hadoop World 2015

    San Jose, CA

    Several of us will be presenting and we’d love to see you there. Join us for our tutorials and sessions, or come visit us at our booth in the Expo Hall.

Serena Cheng

Responsible for architecting solutions and driving them through to completion, Serena draws from over a decade of experience in engineering and architecture work, focusing particularly on complex, high-volume, low-latency, enterprise-wide systems. Her previous work spans a variety of industries including financial services, retail, telecommunications, transportation, and gaming.

At SVDS, she has led architecture advisory, agile data science, and agile data engineering projects. Most recently, she has led an effort that built an enterprise-wide distributed data platform for a large retailer that enables use cases across the business, from customer recommendations for loyalty, to meeting regulatory requirements for the bank division. For another global client, she led multiple data science teams driving insights on how products, trends, labels, seasons, release cycles, and other features affect the behavior of customers both online and in stores.

Prior to joining SVDS, Serena was a Senior Manager at Accenture, managing large technical projects with on- and off-shore teams against tight timelines.

Past Events

2015

  • Data Lakes in the Real World: Ask Us Anything

    Online

    Modern data architectures look radically different as we move towards a new idea of data platforms. During this “ask us anything” webinar we will discuss our experiences building new data architectures and take your questions.

Rick Drushal

Rick has extensive experience with a wide variety of technical and execution architectures, including big data platforms, data services and SOA, virtualization, infrastructure, security, integration, distributed computing, and data transformation. He has led pioneering client delivery engagements including the delivery of NetApp’s Next Generation Phone Home data platform utilizing Hadoop, HBase, and Cassandra. He also led the delivery of a multi-tenant, hybrid use case big data platform for a large datacenter co-location provider. That platform was built using Cassandra and Storm to facilitate multiple different use cases in a centralized manner.

Rick has deep experience in the creation of vertical solutions that include everything from cloud-based infrastructure and big data storage techniques to predictive analytics and custom visualization solutions. He has built solutions utilizing tools such as Cassandra, Hadoop, HBase, Hive, Pig, Storm, Solr, Flume, Pentaho, and custom Java development.

Recent Posts

marbles small files

Handling Small Files in MapR-FS

In this post, we will discuss how dealing with small files is different if you are using MapR-FS rather than the traditional HDFS installation.

Past Events

2017

  • Data Dialogues: Data Strategy

    Online

    The Data Strategy track of our webinar series focuses on creating and continuously updating your data strategy. Register now!

Gary Dusbabek

An Apache Cassandra committer and PMC member, Gary specializes in building distributed systems. Recent experience includes creating an open source high-volume metrics processing pipeline and building out several geographically distributed API services in the cloud.

Recent Posts

Data Ingestion with Spark and Kafka

In this tutorial, we will walk you through some of the basics of using Kafka and Spark to ingest data.

The Data Platform Puzzle

Building or rebuilding a data platform can be a daunting task, as most questions that need to be asked have open-ended answers. But that doesn’t mean you have to guess and use your gut.

materialized views code

Materialized Views with Cassandra

In this screencast, Principal Engineer and Cassandra committer Gary Dusbabek provides an overview of Materialized Views.

The Data Platform Puzzle

Building or rebuilding a data platform can be a daunting task, as most questions that need to be asked have open-ended answers. This post aims to help.

Evaluating Microservices: Real World Lessons

Microservices are a popular topic in developer circles, because they are a means of solving problems that have plagued monolithic software projects for decades. Namely, tardiness and bugs, both caused by complexity.

Past Events

2017

  • OSCON Texas 2017

    Austin, TX

    OSCON is a long-running conference focused on open source technology and communities. We’ll be there talking about our “push button” infrastructure tool.

2016

  • Enterprise Data World San Diego 2016

    San Diego, CA

    Several of us will be at Enterprise Data World 2016 in San Diego. We’d love to say hi, and hear your thoughts.

  • Strata + Hadoop World

    San Jose, CA

    Many of us will be at the Strata Conference + Hadoop World 2016 in San Jose, and we’d love to see you there!

2015

  • Enterprise Dataversity

    Chicago, IL

    SVDS presents two tutorials: one on Data Strategy and one on building a Data Platform. In addition, Edd Dumbill participates in a panel, “The Data-Driven Organization – New Roles and Relationships.”

  • Strata + Hadoop World NY 2015

    New York, NY

    Several of us will be at the Strata + Hadoop World 2015 Conference in New York in September and we’d love to see you there. Join us for our tutorials and sessions, or come visit us at our booth in the Expo Hall.

  • Cassandra Summit

    Santa Clara, CA

    SVDS presents two sessions at the Cassandra Summit: a look at the migration of our client Allant’s CDI-keying engine from Oracle to Cassandra; and a how-to on using Cassandra as a platform for building a custom distributed system.

  • NoSQL Now!

    San Jose, CA

    What are the essential components of a data platform? SVDS presents a tutorial that will explain how the various parts of the Hadoop and big data ecosystem fit together in production to create a data platform supporting batch, interactive and real-time analytical workloads.

Tom Fawcett

Tom has over 20 years experience applying machine learning and data science across five different companies. He is co-author of the highly regarded and top-selling book Data Science for Business (O’Reilly, 2013), which is now used in over 140 universities around the world.

Prior to joining SVDS, as a senior architect at Proofpoint, Tom applied machine learning techniques, including social network analysis and probabilistic inference, to email analysis and filtering. While at Stanford’s Center for the Study of Language and Information, he led a DARPA-sponsored project on Transfer Learning. He has also held senior research scientist positions at HP Labs, NYNEX, and GTE Labs.

Tom holds a Ph.D. in Computer Science (Machine Learning) from the University of Massachusetts, Amherst. He is an action editor of Machine Learning Journal; he also serves on the editorial boards of the journals Data Mining and Knowledge Discovery and Big Data, as well as on the advisory board of the Berkeley Extension Data Science Program.

Recent Posts

Learning from Imbalanced Classes

For this month’s Throwback Thursday, a post that provides insight and concrete advice on how to tackle imbalanced data.

connecting data science and business puzzle pieces

Merging Data Science and Business

Business leaders cannot afford to ignore their organization’s data—rather, that data should be used to make informed decisions. In this post, Principal Data Scientist Tom Fawcett and Professor of Data […]

Evaluating Data Science Projects

Evaluating Data Science Projects: A Case Study Critique

You should understand whether the right things have been measured and whether the results are suitable for the business problem.

ML vs Stats

Machine Learning vs. Statistics

We (Tom, a Machine Learning practitioner, and Drew, a professional Statistician) have worked together for several years. We believe we have an understanding of the role of each field within data science, which we attempt to articulate here.

Driving Product Engagement with User Behavior Analytics

In this post, we will look at driving product engagement with behavioral data, as well as building an integrated analytical environment.

Data-Driven User Engagement

The promise of data and analytics for product companies is that they can help you understand usage, and improve your ability to build, deploy, and service products to customers much more accurately and efficiently. In this post, we look at understanding the customer life cycle.

Analyzing Caltrain Delays

In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API.

Avoiding Common Mistakes with Time Series Analysis

A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.

Imbalanced Classes FAQ

Here we share some further thoughts on imbalanced classes, and offer more resources.

Learning from Imbalanced Classes

This post gives insight and concrete advice on how to tackle imbalanced data.

Analyzing Caltrain Delays: What We Can Learn

In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API over the past few months. The goal is to get our heads into the data before setting off on building a prediction model.

The Basics of Classifier Evaluation: Part 2

A previous blog post made the point that classifiers shouldn’t use classification accuracy as a performance metric. The next part in this series was going to discuss other evaluation techniques such as ROC curves, profit curves, and lift curves. However, there are several important points to be made first. Here I present a sequence that shows the progression and inter-relation of the issues.

The Basics of Classifier Evaluation: Part 1

If it’s easy, it’s probably wrong.

Avoiding Common Mistakes with Time Series

A basic mantra in statistics and data science is correlation is not causation. This is a lesson worth learning.

Listening to Caltrain: Analyzing Train Whistles with Data Science

As an R&D project, we have been playing with data science techniques to understand and predict delays in the Caltrain system.

Drew Hardin

With an extensive background in both computer science and statistics, Drew enjoys exploring data and finding insights. He is equally interested in the engineering required to build working, maintainable software solutions.

Recent Posts

ML vs Stats

Machine Learning vs. Statistics

We (Tom, a Machine Learning practitioner, and Drew, a professional Statistician) have worked together for several years. We believe we have an understanding of the role of each field within data science, which we attempt to articulate here.

Paul Ho

Paul is an experienced engineering leader, with a strong history in Silicon Valley. Bringing a background in extremely large big data clusters, he blends scale and operations with the know-how to deliver business-driven analytics and applications.

Paul is an experienced engineering leader, with a demonstrated track record at many of Silicon Valley’s leading engineering firms. He has extensive experience building end-to-end data platforms for numerous Fortune 100 companies. Prior to joining SVDS, Paul built, stood-up, and supported one of the two Hadoop clusters at Walmart.com and Walmart Labs, with nodes of up to 1500 machines. He architected a complete Hadoop ecosystem including Hue, HDFS, MapReduce, Hive, Hbase, Pig, and Oozie components. He implemented best practices for cluster administration including MapReduce jobs for fraud detection, investigation of site problems through server logs, analysis of DOS attack sources. He also built the first recommendation engine for Walmart.com. Prior to Walmart, Paul held engineer positions at Sun Microsystems and Silicon Graphics.

Paul holds a Bachelor of Science in Electrical Engineering and Computer Science from the Illinois Institute of Technology.