Posts Tagged ‘Data science’

Crossing the Development to Production Divide

In this post we’ll give an overview of obstacles we’ve faced (you may be able to relate) and talk about solutions to overcome these obstacles.

Analyzing Sentiment in Caltrain Tweets

Analyzing Sentiment in Caltrain Tweets

As a first step to using Twitter activity as one of the data sources for train prediction, we start with a simple question: How do Twitter users currently feel about Caltrain?

Learning from Imbalanced Classes

For this month’s Throwback Thursday, a post that provides insight and concrete advice on how to tackle imbalanced data.

exploring map compass

Exploring the Possibilities of Artificial Intelligence

In this interview, Paco Nathan discusses making life more livable, AI fears, and more.

connecting data science and business puzzle pieces

Merging Data Science and Business

Business leaders cannot afford to ignore their organization’s data—rather, that data should be used to make informed decisions. In this post, Principal Data Scientist Tom Fawcett and Professor of Data Science Foster Provost discuss how businesses can make the most of their analytical teams. Tom and Foster are the authors of Data Science for Business. What aspect […]

Skyline of New York

Strata Data Conference New York 2017

The Strata Data Conference is where cutting-edge science and new business fundamentals intersect—and merge. Several of us will be there in September, discussing platforms, strategy, and tools. Let us know if you’ll be attending and would like to chat.

JupterCon notebook python

Themes from JupyterCon 2017

This past August was the first JupyterCon—an O’Reilly-sponsored conference around the Jupyter ecosystem, held in NYC. In this post we look at the major themes from the conference, and some top talks from each theme.

Evaluating Data Science Projects

Evaluating Data Science Projects: A Case Study Critique

You should understand whether the right things have been measured and whether the results are suitable for the business problem.

Webinar Series A: Data Science

Data Dialogues: Data Strategy

The Data Strategy track of our webinar series focuses on creating and continuously updating your data strategy. Register now!

Webinar Series B: Data in Practice

Data Dialogues: Data in Practice

The Data in Practice track focuses on modern techniques for efficient execution of your data strategy. Register now!

ML vs Stats

Machine Learning vs. Statistics

We (Tom, a Machine Learning practitioner, and Drew, a professional Statistician) have worked together for several years. We believe we have an understanding of the role of each field within data science, which we attempt to articulate here.

Understanding AI Toolkits

Understanding AI Toolkits

As well as developing familiarity with AI techniques, practitioners must choose their technology platforms wisely.

Enough Data Engineering for a Data Scientist

Stephen O’Sullivan, VP of Data Engineering at SVDS, is the featured guest speaker at this event hosted by The Data Lab at the Balmoral Hotel.

Chatbots in Banking

This post explains why chatbots are rising in popularity with banks, the opportunities and challenges presented, and how data science fits into the puzzle.

predictive maintenance IoT

Getting Started with Predictive Maintenance Models

In this post, we’ll start to develop an intuition for how to approach the remaining useful life (RUL) estimation problem and take the first steps in modeling RUL.

From Data Managers to Platform Providers

We are seeing evidence of an important pattern: the creation of internal service platform to meet the data science and analytic needs of organizations.

links

Noteworthy Links: Artificial Intelligence

Interested in how AI is being applied out in the real world? Check out these stories, ranging from fighting food insecurity, to a very low-level version of a butler.

data strategy

SVDS Data Strategy: New Video Available

We’re happy to announce that we have produced Developing a Modern Enterprise Data Strategy as a video product, available from O’Reilly Media and Safari Books Online.

butterfly cocoon maturity

How Mature Are Your Data Capabilities?

In a previous post on data maturity, we discussed a company that was just embarking on a transformation: launching a new services business and building data capabilities to support that business. But what if you’re not starting from the beginning?

Mind The Gap subway sign

Minding Your Data Gaps

In this post we look at how to visualize data gaps, and engage senior leadership.

Measuring tape

Understanding Your Data Maturity

No two situations are the same, but we have found one truism: making a data transformation successful requires much more than simply getting the technology right.

map with thumbtacks and string

Building Connections

In this interview, we talk about extending the concept of interoperability between multiple libraries in Python into other programming languages, and the pain points this will address.

Is Your Data Holding You Back?

In this post, we will discuss what “real” gaps in data look like and how to find them in your organization.

Pile of colorful spinning top toys

Easily Spinning up Data Platforms

A quick overview of the motivation behind our instant and repeatable data platform tool.

Predictive Maintenance for IoT

In this post, we’ll cut through some of the ambiguity around IoT applications, and introduce an example data science problem relevant to the IoT world.

DeepDream: Accelerating Deep Learning With Hardware

The DeepGramAI Hackathon has concluded, check out the project that Data Engineer Matthew Rubashkin worked on.

TensorFlow RNN Tutorial

In this post, we’ll provide a short tutorial for training a RNN for speech recognition; we’re including code snippets throughout, and an accompanying GitHub repository. The software we’re using is a mix of borrowed and inspired code from existing open source projects.

Spark Summit: Ignition in the Enterprise

We are excited to announce for Spark Summit 2017 in San Francisco, Edd Wilder-James will be joining Reynold Xin as co-chair of the Spark Summit program.

Four Data Capabilities for Telecommunications

This post looks at four business analysis capabilities that connect the dots between promising applications of data assets for telecommunications companies.

structure

Building Tech Communities

In this interview, Travis talks about how to balance enterprise and open source, as well as what it takes to build a community.

magnifying glass and map

The Value of Exploratory Data Analysis

In this post, we will give a high level overview of what EDA typically entails and then describe three of the major ways EDA is critical to successfully model and interpret its results.

How to Navigate the Jupyter Ecosystem

In this post, we’ll be talking through a few tools that help make data science teams more productive.

Open Source Toolkits for Speech Recognition

This article reviews the main options for free speech recognition toolkits that use traditional HMM and n-gram language models.

Breaking Down Communication Barriers in Tech

Travis and I discuss breaking down silos, the importance of effectively communicating about cutting-edge technology, and where Anaconda is going next.

Analyzing Caltrain Delays

In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API.

Getting Started with Deep Learning

One way to give back to the open source community that provides us with tools is to help others evaluate and choose those tools in a way that takes advantage of our experience. We offer this analysis, along with explanations of the various criteria upon which we based our decisions.

TensorFlow Image Recognition on a Raspberry Pi

In this post, Matt talks about using TensorFlow to detect true and false positives in our Caltrain work.

Data Opportunities in Insurance

In this post we explore how data is changing the insurance industry, through the lens of auto insurance underwriting.

Avoiding Common Mistakes with Time Series Analysis

A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.

Imbalanced Classes FAQ

Here we share some further thoughts on imbalanced classes, and offer more resources.

Techniques and Technologies: Topology and TensorFlow

In early December we hosted a meetup, featuring Dr. Alli Gilmore discussing topological data analysis, and Dr. Andrew Zaldivar covering practical usage of Tensorflow.

Agile Data Science Teams Deliver Real World Results

In this post, CTO John Akred looks at the practical ingredients of managing agile data science.

Embracing Experimentation at AstroHackWeek 2016

Senior Data Scientist Jonathan Whitmore talks about experimentation and agility, based on his time at the unconference.

The Venn Diagram of Data Strategy

Data strategy matters to both business and tech. It’s a problem that sits in the center of a Venn diagram, and if we get stuck thinking of those two domains as existing solely in completely separate silos, we’ll lock ourselves out of that key middle ground where the really important problems get solved.

The One Key Skill of the CDO

In this post, Julie talks about the necessary skills for a CDO, as learned through her research for the “Understanding the Chief Data Officer” report.

With Data, Ask “What” Before “How”

At the Strata + Hadoop World conference in New York last week, there were an impressive 16 tracks of session talks. A lot of them focused on the tools that everyone is excited about, but I focused on the goals people are using data science to accomplish. Here are a few of the sessions that stood out.

links

Noteworthy Links: September 22 2016

We’re at Enterprise Dataversity this week in Chicago, and next week we’ll be in NYC for Strata + Hadoop World. In the midst of this busy September, here are some articles we’ve come across and enjoyed.

Jupyter Notebook Best Practices for Data Science

We present some best practices that we implemented after working with the Notebook—and that might help your data science teams as well.

image processing feature

Image Processing in Python

In this post, we use a Jupyter Notebook go over the steps for creating a proof of concept for the image processing piece of our Caltrain work.

Introduction to Trainspotting

In this post we’ll start looking at the nuts and bolts of making our Caltrain work possible: image processing, video analysis, and image recognition.

Learning from Imbalanced Classes

This post gives insight and concrete advice on how to tackle imbalanced data.

Scaling Data Science: Dream Big, Start Medium-ish

On July 13th we welcomed the Open Data Science Conference meetup series to our HQ—our speaker talked about thinking critically about the size of your data.

How I Learned to Stop Worrying and Love Ephemeral Storage

This post will show architects and developers how to set up Hadoop to communicate with S3, use Hadoop commands directly against S3, use distcp to perform transfers between Hadoop and S3, and how distcp can be used to update on a regular basis based only on differences.

Brain Monitoring with Kafka, OpenTSDB, and Grafana

A team of our data scientists recently won 2nd place in Confluent’s Kafka Hackathon. In this post, explore their project—streaming EEG data and visualizing it.

links

Noteworthy Links: Social Media Edition

In this post we share some links to interesting work being done with social media data.

Hadooponomics Interview: The Evolution of Data

VP of Strategy Edd Dumbill was recently interviewed by James Haight on the Hadooponomics podcast. Find the audio and transcript here.

One Year Later, Observations on the Big Data Market

Back in 2014, we discussed how the market looked like on our first birthday. As we hit three years, it seems like an appropriate time to look back on those observations, and see where we are now.

links

Noteworthy Links: Hadoop Edition

Hadoop is 10 years old! Check out these related links.

Talking About the Caltrain

On May 6th, SVDS hosted an Open Data Science Conference (ODSC) Meetup in our Mountain View headquarters. Data Engineer Harrison Mebane and Data Scientist Christian Perez presented on our Caltrain project.

Working Effectively in Data Science Teams

On April 21st, SVDS hosted the WWCode Silicon Valley chapter in our Mountain View office; we gave a talk titled Working Effectively in Data Science Teams.

IoT and Resilient Systems

We believe there are clearly some compelling value propositions that come from integrating the visibility from the IoT into applications that help understand and manage the state of complex systems. With the internet of things, the more things, really, the merrier.

Jupyter Notebook for Data Science Teams

Data Scientist Jonathan Whitmore has just released a screencast tutorial for Jupyter Notebooks.

Successful Data Teams are Agile and Cross-Functional

I was always struck by how the Silicon Valley startups I worked with could do so much more, with so much less. I’ve come to learn, sometimes the hard way, that there are critical elements of the “who” and the “how,” particular to those start-up teams, that contribute to their success. It’s why we named our company for Silicon Valley: a lightweight, agile approach to data-driven product development was pioneered here.

SVDS at Strata San Jose 2016

Several of our presenters were interviewed at Strata San Jose. If you missed the conference, check out these interviews below to catch up on some of the topics that were on our minds.

Why Notebooks Are Super-Charging Data Science

There is little limit to what can be done with a notebook. As well as the data science work you might expect, such as manipulating and graphing data, we’ve used them for sharing work on analytical tasks such as motion detection in video. In this post Edd takes a look at why we’re seeing notebooks everywhere.

Analyzing Caltrain Delays: What We Can Learn

In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API over the past few months. The goal is to get our heads into the data before setting off on building a prediction model.

Crossing the Development to Production Divide

We know what it’s like to deal with complex production deployments that cover the gamut from infrastructure upgrades, to feature deployments, to data migrations, where each step threatens to derail the plan. In this post she’ll give an overview of obstacles she’s faced (you may be able to relate) and talk about solutions to overcome these obstacles.

The Basics of Classifier Evaluation: Part 2

A previous blog post made the point that classifiers shouldn’t use classification accuracy as a performance metric. The next part in this series was going to discuss other evaluation techniques such as ROC curves, profit curves, and lift curves. However, there are several important points to be made first. Here I present a sequence that shows the progression and inter-relation of the issues.

Advanced Spark Meetup Recap

Our audience of engineers got right into the guts of Spark’s GraySort benchmark win last year with Chris Fregly from IBM Spark Technology Center. Here are a few highlights from the meetup.

How Do You Build a Data Product?

Data products are the reason data scientists are lately treated like rockstars. Along the way at SVDS, we’ve learned a few things about data products, which we shared as we told the story of the Caltrain Rider app.

Jupyter Notebook Best Practices for Data Science

We present here some best-practices that SVDS has implemented after working with the Jupyter Notebook in teams and with our clients.

Zero to Kaggle in 30 Minutes

We’ll walk through the steps for competing in Kaggle’s “Digit Recognizer” contest using SQL-based machine learning tools to identify hand-written digits.

Better Know the Districts

One might reasonably judge how well the congress reflects the views of the citizenry by examining the proportion of those citizens who think congress is doing a good job.

Avoiding Common Mistakes with Time Series

A basic mantra in statistics and data science is correlation is not causation. This is a lesson worth learning.

Listening to Caltrain: Analyzing Train Whistles with Data Science

As an R&D project, we have been playing with data science techniques to understand and predict delays in the Caltrain system.

Railroad Modeling at Hadoop Scale

Data Scientist Tatsiana Maskalevich and CTO John Akred presented at this year’s Hadoop Summit in San Jose.

Data Strategy in a World of Big Data

Silicon Valley Data Science has designed a new method to create a data strategy to overcome limitations of conventional approaches.

Successful Data Teams are Agile and Cross-Functional

A key element of Silicon Valley software product delivery teams is an agile delivery methodology—oil to the water of firm fixed price contracts.

Storing and Visualizing Time Series with Graphite

Graphite is a tool that does two things rather well: storing numeric time-series data (metric, value, epoch timestamp), and rendering graphs of this data on demand.

When Fair Isn’t Predictable: The Law of Averages

When making decisions with data, the idea that things will “even out” may ring true, but it’s not always helpful.