In previous posts, I shared conversations with Travis Oliphant, co-founder of Continuum Analytics, where we talked about new developments in tech and building communities. In this final installment, I talk with Travis about his current project to extend the concept of interoperability between multiple libraries in Python into other programming languages, and the pain points this will address.
We’ve talked about a number of projects that you have done, or are working on now. Do you have anything in mind next?
Yes, I do. To me it is the culmination of my SciPy work. So I’m super excited about it. I feel like with all of the experience gained by going through different aspects of bringing SciPy, NumPy, Numba, and Conda to life, and the community aspect and organization—it is breaking down barriers.
Are you familiar with the buffer protocol? Not many people are in Python, but you know the concept of fixing it twice? That is a general concept of, hey, if you find the problem, don’t just fix the proximate reason, fix the ultimate reason. Fix it twice. So NumPy became a necessity because there was a split with Numarray and Numeric and all these libraries were growing up with these different silos inside of Python itself—that is what drove me to stop trying to get tenure and jump off a cliff into writing NumPy. But I was driven mostly by a sense of duty and love for this community of people I had been interacting with for about eight years at the time.
So I wrote NumPy, but didn’t just write NumPy. I actually also worked with the Python world to create a buffer protocol. This is a way for different kinds of structures in Python to talk about data in the same way. NumPy is a common array structure, but we also created a common underlying interface so even if there is are new array objects sometime in the future, they can talk to each other easily without copying data. So we created an array interface—a buffer protocol to cover Python (and also an array interface that is kind of outside of Python core). All this was done with PEP 3118—I intentionally worked with Guido and folks in the Python community, and got Python core dev rights to do it. I don’t think they took them away yet, but that’s why I was a Python core committer for a while—because I wrote this stuff to make it into the Python core. This was intentionally a “fix-it twice” thing, that got it so that the Python imaging library could talk to NumPy arrays. It took another six years from that creation of the interface for people to start writing the code that takes advantage of it. Now when people write a plug-in for any data-centric library, they can use memoryviews and the buffer protocol so that NumPy and other array-like structures can just see it immediately and not have to copy data around. This work broke down barriers of how people interpret data together.
That buffer protocol was the thing I’m most proud of. You know, I love what happened with NumPy. I’m so happy that it became a useful thing that could benefit a lot of people, but I was also proud of the Python work. Even though there are so many things happening that I didn’t feel like I finished it. A few other people actually helped finish it—Antoine Pitrou and Stefan Krah. They actually work with me now because I was so impressed with their work of finishing this initial effort I started that was a foundational effort.
The next thing is actually to extend that concept of interoperability between multiple libraries in Python to every language. We want to allow Scala, R, Python, Ruby, Lua, everybody to talk about data in the same way. You can have a pipeline of interoperating function calls, where you don’t have to copy data to different language run-times.If you want to use a thing, right now you have to copy data out of whatever you are doing and put it into whatever they are doing. Because we don’t have common ways to talk about data in a shared memory segment. Languages have abstracted that away in different ways.
So it is undoing an oversight. Or an issue that wasn’t an issue then but is now. It was good idea at the time, but subsequent events have left that idea to be a problem. The oversight of encapsulating type and what something is and what is a memory and embedding that implicitly in the language. So somebody wants to say I have a dictionary or I have an array, or I have a hash table—there is no uniform notion of that. There is a semantic concept of that, but not how it is actually implemented in memory. Even though we may be on the same hardware—you know, I have the same MacBook and I have got applications running and they both have a hash table. But those two applications—one in Scala, one in Python, one in R—they have no clue what it means to be a hash table together. What it means is embedded implicitly in those specification of the language. That is fundamentally what creates the siloization. The fix is to create a common data description language for type and shape (a data-shape).
One of the pieces of buffer protocol was the type system and we used the struct module’s crude definitions of type (it wasn’t complete). One of the things I am working on next is figuring out, OK, how should array-oriented computing work? It is actually three things, separated into a type or a data description language—a way to talk about how things are in memory or bytes, you know, hash table, an array, a data frame. What does it mean to be one of those, all the way down to the bits?
You basically need a concrete way that it is shared across multiple language silos of how things are laid out and represented in memory.. Then you have a function infrastructure—a generalized (multiple-dispatch) pipeline infrastructure that is simply a way to stitch together function calls whose arguments are defined by this type system that is shared across languages. And then you have a backing store where all the data lives in a shared memory accessible by many run-times.. This concept internally goes under the concept of a memory- or a data- fabric.
So type, function, and data objects—those are computer science things that can be re-used in many, many ways, and then the array container, which is already in Python (the buffer protocol) and then NumPy itself is just a library with computation on top. But that library of computation on top could also be another client. It could be in Scala, it could be in R, it could be in whatever, but you are using the same fundamental notion of what a data object is, and therefore you can have a system where your Scala code loads this thing from that silo into a shared memory segment and then I call out to this R library, which just points to that same shared memory segment and does the transformation, and I pull another Python library that does the same thing, and they are all talking to each other without having to copy data back and forth unnecessarily.
Copying data back and forth doesn’t matter if you are talking about kilobytes and even megabytes, but when you start to get gigabytes, terabytes, and petabytes, the speed of light is the limit. You are not going to copy bits faster than a certain amount. And as the data sizes grow, that becomes an impediment so the premature typed encapsulation ends up becoming the siloization root cause. And it is broken down by essentially doing what we did with the buffer protocol in Python—the fix it twice notion in NumPy and the fix it twice in Python, and taking that general principle and applying it across the board. So that is what I’m excited about next.
Sounds like you’ll be solving some headaches.
Yeah. And the value to people would be you don’t have to redo everything. The client you use to interact with this system, then becomes a personal choice. You love Scala and you want to stay with Scala, OK, that’s fine. You can still use a library of functionality from another system. It is easy to take libraries in R or Python or wherever they come up. You are not paying an integration penalty, which you do today. Now people can make individual choices. Someone can say, “I like R better, I’m used to it.” OK, great, stay with R. But you can still benefit from all the innovation happening in whatever language it is happening in.
I suspect there will be lots of opinions around it all too.
Oh, yeah, as you can imagine. I see how it works and I have a history to know how it can work, but I also understand that communicating it and how people understand it to the point where they interpret it correctly—because everyone comes to things with their own lens. When I have explained this to some people, they don’t get it because I’m undoing some confirmation bias they already have. It takes me a while to figure out they have that bias. They end up assigning incorrect meaning. Meaning that I’m not talking about. Because they haven’t had the same lived experience.
It is a foundational idea but it requires a cathedral—some real effort up front that is fundamentally hard to pay for, frankly. Nobody wants to pay for the foundation, but they will love the end result. Once you get it, then it will snowball. And it becomes universal pretty quickly once you hit a tipping point.
That will take five years. To get to that point where it is like you look back and say, oh, isn’t that obvious? NumPy was written in 2005, but it wasn’t until 2009 when everybody thought, oh, it’s obvious. So if I can pull this off by the end of 2017 that would be great, but it will take until 2020 before everyone sees, oh, of course, why didn’t we do this earlier. If you look at what we are doing with data-shape, blaze, numba, odo and pluribus. All of the efforts we have done here are dancing around this fundamental core thing. Data fabric is a recent one with a simple prototype out there. But, the real story is just beginning.
Editor’s note: The above has been edited for length and clarity.