Archive for the ‘Tools’ Category

Reshaping Data with Pivot in Spark

Andrew gives you a deep dive into pivoting data with SparkSQL. This piece was originally posted on the Databricks blog.

Data Day and Graph Day Texas Slides

Check out the slides from our recent presentations at Data Day TX and Graph Day.

Space Shuttle Problems: Long-term Planning Amid Changing Technology

How can you manage your implementation in a way that allows you to take maximum advantage of technology innovation as you go, rather than having to freeze your view of technology to today’s state and design something that will be outdated when it launches? You must start by deciding which pieces are necessary now, and which can wait.

Pivoting Data in SparkSQL

Andrew Ray, Senior Data Engineer, contributed to the most recent release of Spark. This post gives examples of how to use his pivot commit in PySpark.

Advanced Spark Meetup Recap

Our audience of engineers got right into the guts of Spark’s GraySort benchmark win last year with Chris Fregly from IBM Spark Technology Center. Here are a few highlights from the meetup.

From Impala to Hive with Love

While on paper it should be a seamless transition to run Impala code in Hive, in reality it’s more like playing a relentless game of whack-a-mole. This post provides hints to make the transition easier.

Develop Spark Apps on YARN Using Docker

Rather than get bitten by the idiosyncrasies involved in running Spark on YARN vs. standalone when you go to deploy, here’s a way to set up a development environment for Spark that more closely mimics how it’s used in the wild.

Jupyter Notebook Best Practices for Data Science

We present here some best-practices that SVDS has implemented after working with the Jupyter Notebook in teams and with our clients.

Use Cases for Apache Spark

The Apache Spark big data processing platform has been making waves in the data world, and for good reason.