In this post we’ll give an overview of obstacles we’ve faced (you may be able to relate) and talk about solutions to overcome these obstacles.
Posts Tagged ‘Technical’
For this month’s Throwback Thursday, a post that provides insight and concrete advice on how to tackle imbalanced data.
In this post, we will discuss how dealing with small files is different if you are using MapR-FS rather than the traditional HDFS installation.
This past August was the first JupyterCon—an O’Reilly-sponsored conference around the Jupyter ecosystem, held in NYC. In this post we look at the major themes from the conference, and some top talks from each theme.
In this tutorial, we will walk you through some of the basics of using Kafka and Spark to ingest data.
Earlier this year, YCombinator-backed startup DeepGram hosted a deep learning hackathon. This post describes the winning project.
We summarize the objectives and contents of our PyCon tutorial, and then provide instructions for following along so you can begin developing your own EDA skills.
In this post, we’ll start to develop an intuition for how to approach the remaining useful life (RUL) estimation problem and take the first steps in modeling RUL.
In this post, we will cover some of the basics of monitoring and alerting as it relates to data pipelines in general, and Kafka and Spark in particular.
We are seeing evidence of an important pattern: the creation of internal service platform to meet the data science and analytic needs of organizations.
Interested in how AI is being applied out in the real world? Check out these stories, ranging from fighting food insecurity, to a very low-level version of a butler.
In this post we provide a framework for choosing a data format, and provide some example use cases.
If you are on the path to being a data-driven company, you have to be on the path to being a development-enabled company.
In this post we’ll look at some real world examples of managing headaches while moving to Hadoop.
A quick overview of the motivation behind our instant and repeatable data platform tool.
In this post, we’ll cut through some of the ambiguity around IoT applications, and introduce an example data science problem relevant to the IoT world.
The DeepGramAI Hackathon has concluded, check out the project that Data Engineer Matthew Rubashkin worked on.
In this post, we’ll walk you through how to use tuning to make your Spark/Kafka pipelines more manageable.
In this post, we’ll provide a short tutorial for training a RNN for speech recognition; we’re including code snippets throughout, and an accompanying GitHub repository. The software we’re using is a mix of borrowed and inspired code from existing open source projects.
Deploying a model without a rigorous process in place has consequences. We go over techniques for successful deployment and management.
In this post, we’ll be talking through a few tools that help make data science teams more productive.
This article reviews the main options for free speech recognition toolkits that use traditional HMM and n-gram language models.
In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API.
One way to give back to the open source community that provides us with tools is to help others evaluate and choose those tools in a way that takes advantage of our experience. We offer this analysis, along with explanations of the various criteria upon which we based our decisions.
In this post, Matt talks about using TensorFlow to detect true and false positives in our Caltrain work.
Being data-driven means breaking down silos within organizations, promoting communication, and being deliberate about the data you collect and use. Here are five articles that illustrate how modern organizations are tackling this challenge.
A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.
Here we share some further thoughts on imbalanced classes, and offer more resources.
In early December we hosted a meetup, featuring Dr. Alli Gilmore discussing topological data analysis, and Dr. Andrew Zaldivar covering practical usage of Tensorflow.
Senior Data Scientist Jonathan Whitmore talks about experimentation and agility, based on his time at the unconference.
In this post, we discuss our Raspberry Pi streaming video analysis software, which we use to better predict Caltrain delays.
We’re at Enterprise Dataversity this week in Chicago, and next week we’ll be in NYC for Strata + Hadoop World. In the midst of this busy September, here are some articles we’ve come across and enjoyed.
We present some best practices that we implemented after working with the Notebook—and that might help your data science teams as well.
In this post, we use a Jupyter Notebook go over the steps for creating a proof of concept for the image processing piece of our Caltrain work.
In this post we’ll start looking at the nuts and bolts of making our Caltrain work possible: image processing, video analysis, and image recognition.
We detail insights learned while attending the recent Predix Transform conference.
This post gives insight and concrete advice on how to tackle imbalanced data.
On July 13th we welcomed the Open Data Science Conference meetup series to our HQ—our speaker talked about thinking critically about the size of your data.
This post will show architects and developers how to set up Hadoop to communicate with S3, use Hadoop commands directly against S3, use distcp to perform transfers between Hadoop and S3, and how distcp can be used to update on a regular basis based only on differences.
This post gives you a quick overview of the new structured streaming feature in Spark 2.0, illustrating why it’s an exciting addition.
A team of our data scientists recently won 2nd place in Confluent’s Kafka Hackathon. In this post, explore their project—streaming EEG data and visualizing it.
In this post, we cover what’s needed to understand user activity, and we look at some pipeline architectures that support this analysis.
This post walks you through a simple failure recovery mechanism, as well as a test harness that allows you to make sure this mechanism works as expected.
In this post we share some links to interesting work being done with social media data.
In this screencast, Principal Engineer and Cassandra committer Gary Dusbabek provides an overview of Materialized Views.
In this post, we’re going to go over the capabilities you need to have in place in order to successfully build and maintain data systems and data infrastructure.
On May 6th, SVDS hosted an Open Data Science Conference (ODSC) Meetup in our Mountain View headquarters. Data Engineer Harrison Mebane and Data Scientist Christian Perez presented on our Caltrain project.
On April 21st, SVDS hosted the WWCode Silicon Valley chapter in our Mountain View office; we gave a talk titled Working Effectively in Data Science Teams.
Data Scientist Jonathan Whitmore has just released a screencast tutorial for Jupyter Notebooks.
In this post, Richard walks you through a demo based on the Meetup.com streaming API to illustrate how to predict demand in order to adjust resource allocation.
Here are some links from around the internet to get you in a Strata state of mind.
There is little limit to what can be done with a notebook. As well as the data science work you might expect, such as manipulating and graphing data, we’ve used them for sharing work on analytical tasks such as motion detection in video. In this post Edd takes a look at why we’re seeing notebooks everywhere.
In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API over the past few months. The goal is to get our heads into the data before setting off on building a prediction model.
It’s easy to become overwhelmed when it comes time to choose a data format. In this post Silvia gives you a framework for approaching this choice, and provide some example use cases.
We know what it’s like to deal with complex production deployments that cover the gamut from infrastructure upgrades, to feature deployments, to data migrations, where each step threatens to derail the plan. In this post she’ll give an overview of obstacles she’s faced (you may be able to relate) and talk about solutions to overcome these obstacles.
The Ethereum network is a distributed economy like Bitcoin, except it is much, much more powerful. Rick Seeger dives into why you should be paying attention to its popularity.
Andrew gives you a deep dive into pivoting data with SparkSQL. This piece was originally posted on the Databricks blog.
Check out the slides from our recent presentations at Data Day TX and Graph Day.
Andrew Ray, Senior Data Engineer, contributed to the most recent release of Spark. This post gives examples of how to use his pivot commit in PySpark.
A previous blog post made the point that classifiers shouldn’t use classification accuracy as a performance metric. The next part in this series was going to discuss other evaluation techniques such as ROC curves, profit curves, and lift curves. However, there are several important points to be made first. Here I present a sequence that shows the progression and inter-relation of the issues.
Our audience of engineers got right into the guts of Spark’s GraySort benchmark win last year with Chris Fregly from IBM Spark Technology Center. Here are a few highlights from the meetup.
While on paper it should be a seamless transition to run Impala code in Hive, in reality it’s more like playing a relentless game of whack-a-mole. This post provides hints to make the transition easier.
Today, the currency supply supported by the Bitcoin blockchain is worth four billion dollars. So, what have we learned? There are five essential properties any good blockchain must have.
We present here some best-practices that SVDS has implemented after working with the Jupyter Notebook in teams and with our clients.
Microservices are a popular topic in developer circles, because they are a means of solving problems that have plagued monolithic software projects for decades. Namely, tardiness and bugs, both caused by complexity.
Since the blockchain is both easily accessible and immutable, it is incredibly useful for other purposes. Issuing a tiny fraction of a Bitcoin (called dust) with embedded data allows anyone to easily store data permanently and publicly.
When it comes to data strategy, many people read the “data” part and automatically dump the topic in the “technical” bucket.
Understanding a major problem set and then synthesizing that information into actionable design tasks is the key to turning data into useful visualizations.
This event’s strongly recurring themes included the Pandas library and a fascinating host of wetware issues woven into otherwise technical sessions.