Posts Tagged ‘Technical’

Crossing the Development to Production Divide

In this post we’ll give an overview of obstacles we’ve faced (you may be able to relate) and talk about solutions to overcome these obstacles.

Learning from Imbalanced Classes

For this month’s Throwback Thursday, a post that provides insight and concrete advice on how to tackle imbalanced data.

marbles small files

Handling Small Files in MapR-FS

In this post, we will discuss how dealing with small files is different if you are using MapR-FS rather than the traditional HDFS installation.

JupterCon notebook python

Themes from JupyterCon 2017

This past August was the first JupyterCon—an O’Reilly-sponsored conference around the Jupyter ecosystem, held in NYC. In this post we look at the major themes from the conference, and some top talks from each theme.

Data Ingestion with Spark and Kafka

In this tutorial, we will walk you through some of the basics of using Kafka and Spark to ingest data.

Mind Reading

Mind Reading: Using Artificial Neural Nets to Predict Viewed Image Categories From EEG Readings

Earlier this year, YCombinator-backed startup DeepGram hosted a deep learning hackathon. This post describes the winning project.

Exploratory data analysis in Python

Exploratory Data Analysis in Python

We summarize the objectives and contents of our PyCon tutorial, and then provide instructions for following along so you can begin developing your own EDA skills.

predictive maintenance IoT

Getting Started with Predictive Maintenance Models

In this post, we’ll start to develop an intuition for how to approach the remaining useful life (RUL) estimation problem and take the first steps in modeling RUL.

kafka spark pipelines monitoring alerting

Managing Spark and Kafka Pipelines

In this post, we will cover some of the basics of monitoring and alerting as it relates to data pipelines in general, and Kafka and Spark in particular.

From Data Managers to Platform Providers

We are seeing evidence of an important pattern: the creation of internal service platform to meet the data science and analytic needs of organizations.

links

Noteworthy Links: Artificial Intelligence

Interested in how AI is being applied out in the real world? Check out these stories, ranging from fighting food insecurity, to a very low-level version of a butler.

How to Choose a Data Format

In this post we provide a framework for choosing a data format, and provide some example use cases.

Graphic of a button that is off and one that is on

Realize the Business Power of Your Data with DevOps

If you are on the path to being a data-driven company, you have to be on the path to being a development-enabled company.

Graphic of pipes, in shades of gray

Data Pipelines in Hadoop

In this post we’ll look at some real world examples of managing headaches while moving to Hadoop.

Pile of colorful spinning top toys

Easily Spinning up Data Platforms

A quick overview of the motivation behind our instant and repeatable data platform tool.

Predictive Maintenance for IoT

In this post, we’ll cut through some of the ambiguity around IoT applications, and introduce an example data science problem relevant to the IoT world.

DeepDream: Accelerating Deep Learning With Hardware

The DeepGramAI Hackathon has concluded, check out the project that Data Engineer Matthew Rubashkin worked on.

pipelines

Making Spark and Kafka Data Pipelines Manageable with Tuning

In this post, we’ll walk you through how to use tuning to make your Spark/Kafka pipelines more manageable.

TensorFlow RNN Tutorial

In this post, we’ll provide a short tutorial for training a RNN for speech recognition; we’re including code snippets throughout, and an accompanying GitHub repository. The software we’re using is a mix of borrowed and inspired code from existing open source projects.

Models: From the Lab to the Factory

Deploying a model without a rigorous process in place has consequences. We go over techniques for successful deployment and management.

How to Navigate the Jupyter Ecosystem

In this post, we’ll be talking through a few tools that help make data science teams more productive.

Open Source Toolkits for Speech Recognition

This article reviews the main options for free speech recognition toolkits that use traditional HMM and n-gram language models.

Analyzing Caltrain Delays

In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API.

Getting Started with Deep Learning

One way to give back to the open source community that provides us with tools is to help others evaluate and choose those tools in a way that takes advantage of our experience. We offer this analysis, along with explanations of the various criteria upon which we based our decisions.

TensorFlow Image Recognition on a Raspberry Pi

In this post, Matt talks about using TensorFlow to detect true and false positives in our Caltrain work.

links

Noteworthy Links: Using Data Creatively

Being data-driven means breaking down silos within organizations, promoting communication, and being deliberate about the data you collect and use. Here are five articles that illustrate how modern organizations are tackling this challenge.

Avoiding Common Mistakes with Time Series Analysis

A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.

Imbalanced Classes FAQ

Here we share some further thoughts on imbalanced classes, and offer more resources.

Techniques and Technologies: Topology and TensorFlow

In early December we hosted a meetup, featuring Dr. Alli Gilmore discussing topological data analysis, and Dr. Andrew Zaldivar covering practical usage of Tensorflow.

Embracing Experimentation at AstroHackWeek 2016

Senior Data Scientist Jonathan Whitmore talks about experimentation and agility, based on his time at the unconference.

Streaming Video Analysis in Python

In this post, we discuss our Raspberry Pi streaming video analysis software, which we use to better predict Caltrain delays.

links

Noteworthy Links: September 22 2016

We’re at Enterprise Dataversity this week in Chicago, and next week we’ll be in NYC for Strata + Hadoop World. In the midst of this busy September, here are some articles we’ve come across and enjoyed.

Jupyter Notebook Best Practices for Data Science

We present some best practices that we implemented after working with the Notebook—and that might help your data science teams as well.

image processing feature

Image Processing in Python

In this post, we use a Jupyter Notebook go over the steps for creating a proof of concept for the image processing piece of our Caltrain work.

Introduction to Trainspotting

In this post we’ll start looking at the nuts and bolts of making our Caltrain work possible: image processing, video analysis, and image recognition.

predix transform iot image

Predix Transform 2016

We detail insights learned while attending the recent Predix Transform conference.

Learning from Imbalanced Classes

This post gives insight and concrete advice on how to tackle imbalanced data.

Scaling Data Science: Dream Big, Start Medium-ish

On July 13th we welcomed the Open Data Science Conference meetup series to our HQ—our speaker talked about thinking critically about the size of your data.

How I Learned to Stop Worrying and Love Ephemeral Storage

This post will show architects and developers how to set up Hadoop to communicate with S3, use Hadoop commands directly against S3, use distcp to perform transfers between Hadoop and S3, and how distcp can be used to update on a regular basis based only on differences.

Structured Streaming in Spark

This post gives you a quick overview of the new structured streaming feature in Spark 2.0, illustrating why it’s an exciting addition.

Brain Monitoring with Kafka, OpenTSDB, and Grafana

A team of our data scientists recently won 2nd place in Confluent’s Kafka Hackathon. In this post, explore their project—streaming EEG data and visualizing it.

pipelines screenshot

Building Pipelines to Understand User Behavior

In this post, we cover what’s needed to understand user activity, and we look at some pipeline architectures that support this analysis.

Kafka Simple Consumer Failure Recovery

This post walks you through a simple failure recovery mechanism, as well as a test harness that allows you to make sure this mechanism works as expected.

links

Noteworthy Links: Social Media Edition

In this post we share some links to interesting work being done with social media data.

materialized views code

Materialized Views with Cassandra

In this screencast, Principal Engineer and Cassandra committer Gary Dusbabek provides an overview of Materialized Views.

Building Data Systems: What Do You Need?

In this post, we’re going to go over the capabilities you need to have in place in order to successfully build and maintain data systems and data infrastructure.

links

Noteworthy Links: Hadoop Edition

Hadoop is 10 years old! Check out these related links.

Talking About the Caltrain

On May 6th, SVDS hosted an Open Data Science Conference (ODSC) Meetup in our Mountain View headquarters. Data Engineer Harrison Mebane and Data Scientist Christian Perez presented on our Caltrain project.

Working Effectively in Data Science Teams

On April 21st, SVDS hosted the WWCode Silicon Valley chapter in our Mountain View office; we gave a talk titled Working Effectively in Data Science Teams.

Jupyter Notebook for Data Science Teams

Data Scientist Jonathan Whitmore has just released a screencast tutorial for Jupyter Notebooks.

Building a Prediction Engine using Spark, Kudu, and Impala

In this post, Richard walks you through a demo based on the Meetup.com streaming API to illustrate how to predict demand in order to adjust resource allocation.

links

Noteworthy Links: Strata Edition

Here are some links from around the internet to get you in a Strata state of mind.

Why Notebooks Are Super-Charging Data Science

There is little limit to what can be done with a notebook. As well as the data science work you might expect, such as manipulating and graphing data, we’ve used them for sharing work on analytical tasks such as motion detection in video. In this post Edd takes a look at why we’re seeing notebooks everywhere.

Analyzing Caltrain Delays: What We Can Learn

In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API over the past few months. The goal is to get our heads into the data before setting off on building a prediction model.

How to Choose a Data Format

It’s easy to become overwhelmed when it comes time to choose a data format. In this post Silvia gives you a framework for approaching this choice, and provide some example use cases.

Crossing the Development to Production Divide

We know what it’s like to deal with complex production deployments that cover the gamut from infrastructure upgrades, to feature deployments, to data migrations, where each step threatens to derail the plan. In this post she’ll give an overview of obstacles she’s faced (you may be able to relate) and talk about solutions to overcome these obstacles.

Ethereum: Rise of the World Computer

The Ethereum network is a distributed economy like Bitcoin, except it is much, much more powerful. Rick Seeger dives into why you should be paying attention to its popularity.

Reshaping Data with Pivot in Spark

Andrew gives you a deep dive into pivoting data with SparkSQL. This piece was originally posted on the Databricks blog.

Data Day and Graph Day Texas Slides

Check out the slides from our recent presentations at Data Day TX and Graph Day.

Pivoting Data in SparkSQL

Andrew Ray, Senior Data Engineer, contributed to the most recent release of Spark. This post gives examples of how to use his pivot commit in PySpark.

The Basics of Classifier Evaluation: Part 2

A previous blog post made the point that classifiers shouldn’t use classification accuracy as a performance metric. The next part in this series was going to discuss other evaluation techniques such as ROC curves, profit curves, and lift curves. However, there are several important points to be made first. Here I present a sequence that shows the progression and inter-relation of the issues.

Advanced Spark Meetup Recap

Our audience of engineers got right into the guts of Spark’s GraySort benchmark win last year with Chris Fregly from IBM Spark Technology Center. Here are a few highlights from the meetup.

From Impala to Hive with Love

While on paper it should be a seamless transition to run Impala code in Hive, in reality it’s more like playing a relentless game of whack-a-mole. This post provides hints to make the transition easier.

5 Things a Blockchain Needs to Succeed

Today, the currency supply supported by the Bitcoin blockchain is worth four billion dollars. So, what have we learned? There are five essential properties any good blockchain must have.

Jupyter Notebook Best Practices for Data Science

We present here some best-practices that SVDS has implemented after working with the Jupyter Notebook in teams and with our clients.

Evaluating Microservices: Real World Lessons

Microservices are a popular topic in developer circles, because they are a means of solving problems that have plagued monolithic software projects for decades. Namely, tardiness and bugs, both caused by complexity.

The Basics of Classifier Evaluation: Part 1

If it’s easy, it’s probably wrong.

Dust in the Blockchain

Since the blockchain is both easily accessible and immutable, it is incredibly useful for other purposes. Issuing a tiny fraction of a Bitcoin (called dust) with embedded data allows anyone to easily store data permanently and publicly.

The Venn Diagram of Data Strategy

When it comes to data strategy, many people read the “data” part and automatically dump the topic in the “technical” bucket.

Getting from Data to Visualization

Understanding a major problem set and then synthesizing that information into actionable design tasks is the key to turning data into useful visualizations.

Thoughts from Euro PyData

This event’s strongly recurring themes included the Pandas library and a fascinating host of wetware issues woven into otherwise technical sessions.

Visualizing the Evolution of Rock Music

Rock ’n’ roll is one of the most popular music genres today, but that wasn’t always the case.