
Crossing the Development to Production Divide
In this post we’ll give an overview of obstacles we’ve faced (you may be able to relate) and talk about solutions to overcome these obstacles.
In this post we’ll give an overview of obstacles we’ve faced (you may be able to relate) and talk about solutions to overcome these obstacles.
As a first step to using Twitter activity as one of the data sources for train prediction, we start with a simple question: How do Twitter users currently feel about Caltrain?
For this month’s Throwback Thursday, a post that provides insight and concrete advice on how to tackle imbalanced data.
In this interview, Paco Nathan discusses making life more livable, AI fears, and more.
Business leaders cannot afford to ignore their organization’s data—rather, that data should be used to make informed decisions. In this post, Principal Data Scientist Tom Fawcett and Professor of Data Science Foster Provost discuss how businesses can make the most of their analytical teams. Tom and Foster are the authors of Data Science for Business. What aspect […]
The Strata Data Conference is where cutting-edge science and new business fundamentals intersect—and merge. Several of us will be there in September, discussing platforms, strategy, and tools. Let us know if you’ll be attending and would like to chat.
This past August was the first JupyterCon—an O’Reilly-sponsored conference around the Jupyter ecosystem, held in NYC. In this post we look at the major themes from the conference, and some top talks from each theme.
You should understand whether the right things have been measured and whether the results are suitable for the business problem.
The Data Strategy track of our webinar series focuses on creating and continuously updating your data strategy. Register now!
The Data in Practice track focuses on modern techniques for efficient execution of your data strategy. Register now!
We (Tom, a Machine Learning practitioner, and Drew, a professional Statistician) have worked together for several years. We believe we have an understanding of the role of each field within data science, which we attempt to articulate here.
As well as developing familiarity with AI techniques, practitioners must choose their technology platforms wisely.
Stephen O’Sullivan, VP of Data Engineering at SVDS, is the featured guest speaker at this event hosted by The Data Lab at the Balmoral Hotel.
This post explains why chatbots are rising in popularity with banks, the opportunities and challenges presented, and how data science fits into the puzzle.
In this post, we’ll start to develop an intuition for how to approach the remaining useful life (RUL) estimation problem and take the first steps in modeling RUL.
We are seeing evidence of an important pattern: the creation of internal service platform to meet the data science and analytic needs of organizations.
Interested in how AI is being applied out in the real world? Check out these stories, ranging from fighting food insecurity, to a very low-level version of a butler.
We’re happy to announce that we have produced Developing a Modern Enterprise Data Strategy as a video product, available from O’Reilly Media and Safari Books Online.
In a previous post on data maturity, we discussed a company that was just embarking on a transformation: launching a new services business and building data capabilities to support that business. But what if you’re not starting from the beginning?
In this post we look at how to visualize data gaps, and engage senior leadership.
No two situations are the same, but we have found one truism: making a data transformation successful requires much more than simply getting the technology right.
In this interview, we talk about extending the concept of interoperability between multiple libraries in Python into other programming languages, and the pain points this will address.
In this post, we will discuss what “real” gaps in data look like and how to find them in your organization.
A quick overview of the motivation behind our instant and repeatable data platform tool.
In this post, we’ll cut through some of the ambiguity around IoT applications, and introduce an example data science problem relevant to the IoT world.
The DeepGramAI Hackathon has concluded, check out the project that Data Engineer Matthew Rubashkin worked on.
In this post, we’ll provide a short tutorial for training a RNN for speech recognition; we’re including code snippets throughout, and an accompanying GitHub repository. The software we’re using is a mix of borrowed and inspired code from existing open source projects.
We are excited to announce for Spark Summit 2017 in San Francisco, Edd Wilder-James will be joining Reynold Xin as co-chair of the Spark Summit program.
This post looks at four business analysis capabilities that connect the dots between promising applications of data assets for telecommunications companies.
In this interview, Travis talks about how to balance enterprise and open source, as well as what it takes to build a community.
In this post, we will give a high level overview of what EDA typically entails and then describe three of the major ways EDA is critical to successfully model and interpret its results.
In this post, we’ll be talking through a few tools that help make data science teams more productive.
This article reviews the main options for free speech recognition toolkits that use traditional HMM and n-gram language models.
Travis and I discuss breaking down silos, the importance of effectively communicating about cutting-edge technology, and where Anaconda is going next.
In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API.
One way to give back to the open source community that provides us with tools is to help others evaluate and choose those tools in a way that takes advantage of our experience. We offer this analysis, along with explanations of the various criteria upon which we based our decisions.
In this post, Matt talks about using TensorFlow to detect true and false positives in our Caltrain work.
In this post we explore how data is changing the insurance industry, through the lens of auto insurance underwriting.
A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.
Here we share some further thoughts on imbalanced classes, and offer more resources.
In early December we hosted a meetup, featuring Dr. Alli Gilmore discussing topological data analysis, and Dr. Andrew Zaldivar covering practical usage of Tensorflow.
In this post, CTO John Akred looks at the practical ingredients of managing agile data science.
Senior Data Scientist Jonathan Whitmore talks about experimentation and agility, based on his time at the unconference.
Data strategy matters to both business and tech. It’s a problem that sits in the center of a Venn diagram, and if we get stuck thinking of those two domains as existing solely in completely separate silos, we’ll lock ourselves out of that key middle ground where the really important problems get solved.
In this post, Julie talks about the necessary skills for a CDO, as learned through her research for the “Understanding the Chief Data Officer” report.
At the Strata + Hadoop World conference in New York last week, there were an impressive 16 tracks of session talks. A lot of them focused on the tools that everyone is excited about, but I focused on the goals people are using data science to accomplish. Here are a few of the sessions that stood out.
We’re at Enterprise Dataversity this week in Chicago, and next week we’ll be in NYC for Strata + Hadoop World. In the midst of this busy September, here are some articles we’ve come across and enjoyed.
We present some best practices that we implemented after working with the Notebook—and that might help your data science teams as well.
In this post, we use a Jupyter Notebook go over the steps for creating a proof of concept for the image processing piece of our Caltrain work.
In this post we’ll start looking at the nuts and bolts of making our Caltrain work possible: image processing, video analysis, and image recognition.
This post gives insight and concrete advice on how to tackle imbalanced data.
On July 13th we welcomed the Open Data Science Conference meetup series to our HQ—our speaker talked about thinking critically about the size of your data.
This post will show architects and developers how to set up Hadoop to communicate with S3, use Hadoop commands directly against S3, use distcp to perform transfers between Hadoop and S3, and how distcp can be used to update on a regular basis based only on differences.
A team of our data scientists recently won 2nd place in Confluent’s Kafka Hackathon. In this post, explore their project—streaming EEG data and visualizing it.
In this post we share some links to interesting work being done with social media data.
VP of Strategy Edd Dumbill was recently interviewed by James Haight on the Hadooponomics podcast. Find the audio and transcript here.
Back in 2014, we discussed how the market looked like on our first birthday. As we hit three years, it seems like an appropriate time to look back on those observations, and see where we are now.
Hadoop is 10 years old! Check out these related links.
On May 6th, SVDS hosted an Open Data Science Conference (ODSC) Meetup in our Mountain View headquarters. Data Engineer Harrison Mebane and Data Scientist Christian Perez presented on our Caltrain project.
On April 21st, SVDS hosted the WWCode Silicon Valley chapter in our Mountain View office; we gave a talk titled Working Effectively in Data Science Teams.
We believe there are clearly some compelling value propositions that come from integrating the visibility from the IoT into applications that help understand and manage the state of complex systems. With the internet of things, the more things, really, the merrier.
Data Scientist Jonathan Whitmore has just released a screencast tutorial for Jupyter Notebooks.
I was always struck by how the Silicon Valley startups I worked with could do so much more, with so much less. I’ve come to learn, sometimes the hard way, that there are critical elements of the “who” and the “how,” particular to those start-up teams, that contribute to their success. It’s why we named our company for Silicon Valley: a lightweight, agile approach to data-driven product development was pioneered here.
Several of our presenters were interviewed at Strata San Jose. If you missed the conference, check out these interviews below to catch up on some of the topics that were on our minds.
There is little limit to what can be done with a notebook. As well as the data science work you might expect, such as manipulating and graphing data, we’ve used them for sharing work on analytical tasks such as motion detection in video. In this post Edd takes a look at why we’re seeing notebooks everywhere.
In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API over the past few months. The goal is to get our heads into the data before setting off on building a prediction model.
We know what it’s like to deal with complex production deployments that cover the gamut from infrastructure upgrades, to feature deployments, to data migrations, where each step threatens to derail the plan. In this post she’ll give an overview of obstacles she’s faced (you may be able to relate) and talk about solutions to overcome these obstacles.
A previous blog post made the point that classifiers shouldn’t use classification accuracy as a performance metric. The next part in this series was going to discuss other evaluation techniques such as ROC curves, profit curves, and lift curves. However, there are several important points to be made first. Here I present a sequence that shows the progression and inter-relation of the issues.
Our audience of engineers got right into the guts of Spark’s GraySort benchmark win last year with Chris Fregly from IBM Spark Technology Center. Here are a few highlights from the meetup.
Data products are the reason data scientists are lately treated like rockstars. Along the way at SVDS, we’ve learned a few things about data products, which we shared as we told the story of the Caltrain Rider app.
We present here some best-practices that SVDS has implemented after working with the Jupyter Notebook in teams and with our clients.
We’ll walk through the steps for competing in Kaggle’s “Digit Recognizer” contest using SQL-based machine learning tools to identify hand-written digits.
One might reasonably judge how well the congress reflects the views of the citizenry by examining the proportion of those citizens who think congress is doing a good job.
A basic mantra in statistics and data science is correlation is not causation. This is a lesson worth learning.
As an R&D project, we have been playing with data science techniques to understand and predict delays in the Caltrain system.
Data Scientist Tatsiana Maskalevich and CTO John Akred presented at this year’s Hadoop Summit in San Jose.
Silicon Valley Data Science has designed a new method to create a data strategy to overcome limitations of conventional approaches.
A key element of Silicon Valley software product delivery teams is an agile delivery methodology—oil to the water of firm fixed price contracts.
Graphite is a tool that does two things rather well: storing numeric time-series data (metric, value, epoch timestamp), and rendering graphs of this data on demand.
When making decisions with data, the idea that things will “even out” may ring true, but it’s not always helpful.