Season 2, Episode 9
Data Driven Software
We branch, version, and test our code, but what if we treated data like code? Tim Hunter joins us to discuss the open-source Data-Driven Software (DDS) package and how it leads to immense gains in collaboration and decreased runtime for data scientists at any organization.
Tim Hunter is a senior AI specialist at the ABN AMRO Bank. He was an early software engineer at Databricks and has contributed to the Apache Spark MLlib project, and co-created the Koalas, GraphFrames, TensorFrames and Deep Learning Pipelines libraries. He has been building distributed Machine Learning systems with Spark since version 0.2, before Spark was an Apache Software Foundation project. He holds a Ph.D in Computer Sciences from UC Berkeley.
Welcome to Data Brew by Databricks with Denny and Brooke. This series allows us to explore various topics in the data and AI community. Whether we’re talking about data engineering or data science, we will interview subject matter experts to dive deeper into these topics. And while we’re at it, we’ll be enjoying our morning brew. My name is Denny Lee, and I’m a developer advocate at Databricks and one of the co-hosts of Data Brew.
And hello everyone. My name is Brooke Wenig, the other co-host of Data Brew and the machine learning practice lead at Databricks. And today I have the pleasure of introducing one of Denny and mine former colleagues, Tim Hunter, to join us today on Data Brew. Tim has been working with Apache Spark since version 0.0.2 before it was an Apache project. He’s also contributed and created many open source packages, including Koalas, spark-sklearn, GraphFrames, TensorFlow, Deep Learning Pipelines and most recently data-driven software, and I’m sure there’s a few other that I forgotten along the way, but great to have you today, Tim.
Yeah. Nice to see you again, Brooke.
All right. So let’s go ahead and kick it off with, how did you get into the field of machine learning?
Let’s see. So I think the story starts for me in 2008. I was doing a master’s at Stanford and there was this class CS229, which now actually is the basis for all the MOOCs in the machine learning. So the advisor who was teaching this class was Andrew Ng, and back then there was no deep learning, people were barely starting to talk about GPUs. And I would say that the king of the day that everybody was talking about for doing recognition with things like SIFTs, all these handcrafted features for doing machine learning.
And Andrew started his class with explaining all the possibilities that you could do with AI, and he started with his video of his helicopter project, which was doing incredible aerobatics, purely remote controlled by computer. And I think, when I saw that, I thought this is it, this is going to be the future of autonomous vehicles. This is going to be the future of everything that we see moving around. And I think this is really what drove me to get started, and this is how I got and after that, I was fortunate enough to actually work on this helicopter projects under the supervision of Andrew Ng and a few of his students. So I think, yeah, this is really where it got started.
So I know after your masters, you followed up with a PhD at Berkeley and then you’ve been working in industry since then, what excites you most about the field of machine learning that keeps you staying in this field?
So to me, this is the, I would say infinite possibilities, you know that with machine learning, you can apply it to everything, every topic that you can think about, you cannot always think about the way that ML or AI can be plugged into it, to improve this field and to improve the knowledge of it. And these days, what really excites me is that there is also a feedback loop, it’s not just ML being applied to solve the problem, but often the ML can also bring some new insights about how this problem is being solved. And for example, these days, if you want to do some complex physics simulations, it turns out that ML is a pretty good solution, even sometimes beating the state of the art that has been handcrafted by people doing some physics-based modeling.
So it really tells us that when you use data, it captures some insight that we, as, as researchers have not necessarily thought about into how the world works, and that ML essentially exposes that in a way that then we need to also understand ourselves. And having this, I would say, conversation with ML, seeing what it gives us as new insights, not only when it predicts something, but how it does it. This is something that I find really exciting because I’m very curious, and I love to see how not only when to send the world through equations, but also how crunching the data that comes out of it. We can also discover some new insights about the world.
Yeah. That’s really cool to Tim. So, I mean, let’s roll back into how you’ve already taken that desire to understand the world with ML, to all the packages you’ve created. So you’ve created, just like Brooke was calling on the beginning, open source packages from spark-sklearn to Koalas, but then most recently you’ve developed one called data-driven software, so can you share what the project is and what is it currently trying to solve that’s not being solved? I’m curious.
Yeah. So when you look into the whole ecosystem for doing their processing, one problem that I see coming over and over again is, when you process your data, how you want to link and chain all these pieces in a way that is fast and coherent. So for example, when you download some data from the internet, then usually you’re not going to just stop at downloading it, you, after that, are going to process it and build a whole pipeline on top of that. For example, you need to combine it with other data sources, you are going to run a machine learning model that you will apply to it. And whenever one of your data sources change, or whenever you change your code, you need to be able to rerun everything that depends out of that.
So DDS, this package, is helping the developers who writes their code in Python by capturing, in a smart way, all the dependencies that you have on your codes when you write a model. So let’s say that you decide to change a hyperparameter into model, it will know which piece of all your pipeline do we run and which results do we generate at the end. So you never have to think about, am I using the latest version? Do I need to rerun this notebook? Because if it needs to be rerun, it will run it for you.
So that’s really cool. So then let’s go into say almost the philosophy around data-driven software, it seems like in our current software engineering practices, are you observing that we as a community or as an industry, we’re focusing on the steps, like each single step in the process in terms of processing versus actually programming statements to describe the data to be matched and the processing required. In other words, we’re focusing more on the steps themselves versus understanding and making sense of the data.
Yeah. This is a very good question. I think that when you look into how we do software right now, we are just really, and especially software around the field of AI and data science, we’re really at the beginning of the journey, we’re barely understanding how to combine the code and how to process data in general. And this is not something that come from me, a lot of people much more senior than I have been saying that about the fields. Andrej Karpathy from Tesla has been talking about rebuilding a whole software 2.0 Stack that would be able to incorporate machine learning models inside regular software. And the question I’ve often asked myself is how can we get started? Where do we go?
And one problem that I see often is that we data scientists, and ML researchers, we spent a lot of time reading and writing pieces when we write code. So this sounds like very much of really, two different paradigm strength to be combined together. Why do we need to think about where things should be stored? We don’t do that when we think about computer code. So the first step that I can think about helping to go into this journey is how can we think in a coherent fashion between all the data pieces that we extract statistics, ML models, larger data sets, and how we combine that with code that does all the transformation.
So I love that data is the artifact with data-driven software, can you talk about how you get some of the scalability behind this? Like, what are you doing to catch certain steps? So you don’t have to recompute these computations, how are you actually storing the data that you’re keeping as an artifact?
Yeah. So the key idea of data-driven software is that whenever you write some code, the output of the code should be always the same when you rerun it and when the code has not changed, if your code changes, then it’s going to create a different output. And in that sense, it solved the problem of thinking about where the data is coming from and how fresh it is it should be, because the data can either be raw, so the original data set that you acquired from somewhere, or it can be the result of a transformation. And when you think about it this way, it mean that you just need to think about the software that created your data. You don’t need to think about the codes that generated all this data sets, and this removes all the questions you may have about, is this stored? Is it the right version? Or should be stored in a different system and then it is not going to be able to sync with something else.
Then you just need to think about your codes and you don’t have any of the question that you would have in, for example, in data warehouse system, where you need to think whether you plugged the right version with each other, and whether you need to rerun your data pipeline. So essentially it is saying solving problems at the data level. This is too hard, it is too big, why don’t we solve it at the code level. And whenever you change your code, this is when you know that you should change your data that goes along with the code. You can think of the data as being just a piece that goes along with the code and to accelerate some of the calculations.
Yeah. I can’t tell you the number of times where I’ve seen data scientists think, Oh, just this one, small tweak to the feature engineering won’t impact anything, let me just check it in without rerunning everything, and boom, everything exploded. It was like they changed imputation from mean to median or something that they think is pretty harmless or even worse. If somebody thinks they’re just adding a comment, but oops, that keystroke slipped. So I definitely see a need for a data-driven software for data scientists, to be able to understand the dependencies of each of the steps in their machine learning pipelines. But I just want to hear from you, what are some of the biggest challenges that you see people face when they’re trying to design these large-scale machine learning pipelines? Like if I’m building a simple scikit-learn model of like one hot encoded linear regression, maybe I don’t need data-driven software. But if I’m building much larger, much more complex pipelines, I definitely need it, and I just want to see what are some of the challenges you see people typically facing with these real world machine learning pipelines?
So one of the main challenges I see in the world of data science in the large company, right now I work at a fairly large company with tens of thousands of employees, and I would say the data science workforce is about 2000 people. And in such a complex environment, especially when the company like mine has been around for more than 200 years, you can imagine that you have a lot of diversity of ideas and code structures and essentially silos of various data providers. And because of that, when you want to do a model which is of any complexity, you need to rely already on dozens subsystems, that we need to be able to sync to give you the data that you can assume is of high enough quality.
For example, in this case, since I’m working at the bank, if you have data set coming from transactions and you want to link it with customers, you need already to be able to say that the customers you are going to see are the ones that correspond to these transactions, not the customers from a month ago, and then miss a few in the process and have all these sort of data alignments problems that tend to be pretty difficult.
So being able to break down these barriers between all the silos, not only at the level of the process, but also at the level of the engineering is something that is still, I would say, a very hard challenge for large companies. And this is why you see now, one more this concept of feature store coming in which essentially boils down to the idea that you can build a set of attributes for various data points that you have. And underneath where this really corresponds to is building a data warehouse with some pipelines to be able to automate the chaining of the dependencies between all these features.
Now, do we need to have a completely separate system for doing that? I think that this problem is more general than that, and I don’t think we need to focus just on the idea of building features for machine learning. For the simple reason that when you build features, they’re going to be input into a model, and this model is typically going to be used to train from other data set that you want to input themselves as features. So there’s already all these loops and these dependencies, and how to capture these dependencies between teams and between processes is really where I see a lot of groups struggling.
Got it. And I know you’re using data-driven software ABN AMRO, can you talk a little bit more about the branching capabilities of it and how this is allowing data scientists to experiment, but without reinventing the wheel and ensuring any code that they write can be checked in safely?
Yeah, so the way we use it is, thanks to the capability I mentioned that, with it we can treat data as if it was curved and in particular, that means you can start from a code base that generates some data sets, and then you can fork this code base change, for example, a mean into a medium, change some hyperparameters, this will create other data sets and all the views of the same data set. And because this is all in its own namespace, like in coding software, if you run it, you’re not going to have an impact on what is happening in the main branch, because it is your code, it is going to be your data. And when you merge, again, this branch into the, for example, let’s say the production branch, then your code has already been executed.
You have already calculated all the statistics that correspond to this branch. So when you merge it back into the production branch, the stable branch, then you have already calculated everything that was on your branch. This is going to be the same results into the stable branch, so everybody will already be able to reuse all the work that you have done, precalculating all these elements, because the code that you are going to merge is going to be the same as the one that you just executed. So because of that, that means that with a system like data-driven software, you do not need to think about how far you are deviating from what everybody else is doing, you can simply fork your code, make all the changes you want and when you merge them back, people will have the confidence that whatever you have been working on it is ready.
It is going to be correct because it is going to evaluate all the dependencies it has, and when you merge it back, then this is going to propagate instantly all the data that you have created, or the new artifacts that you have created to everybody else who sees this new code. So this is why with a system like DDS, you can really think as data as being simply an add-on to the code that you’re writing. Just like in Git you track the history into a repository, this is the same concept, you assign signatures and unique signatures for every piece of data that you create. And as you merge branch fork, the code is going to be the reference and then the data will automatically be attached to all these pieces.
Okay. That sounds pretty cool. But then I’m going to naturally be coming from my background, switch back to big data which is… Well, is there a massively large impact here? Like as in you’ve got all of this code, sure, you’ve got all these artifacts, but considering we’re working with terabytes to petabytes worth of data, what’s the impact here? I’m just curious then.
Yes. So the current version in that sense comes with the very naive assumption, which is, whenever you create a new data set, for example, you take your petabytes size of DNA in your store, you decide to do some filtering. If you change one variable, it is simply going to see a new data set, which is potentially pretty much the same as what was before, but it will have no notion of how much changes there was compared to other versions. It is not checking whether you make differences, it is every data set as a new data set. So the impact is that it may end up rewriting a lot of your tables, if you make a change, it may decide to say, we’re going to reprocess this petabytes and we’re going to simply allocate the new petabytes in your data store for doing that.
And this is why it comes with a graphical feature that allows you to see if you make a change before it runs, it allows you to see the graph of all the dependencies and all the data sets that will need to be regenerated. So before you run it, it will already be able to tell you pretty much, how much data you’re going to process and write down after that. Typically, I would say it is not so much of a problem because just like in a software problem, if you modify a very fundamental component at the bottom, that is going to have an impact on everything. For example, you modify the way something talks to the database, then it’s going to impact all your modules all over the place, because everybody wants to talk to a database.
This is the same problem, there’s nothing new here. The software engineers have the same problem of when they make changes that fundamentally rearchitect everything, they will need to think how to actually make some other path to just creating an alternative version of it, then migrating everybody into it and so on. And with a system like DDS, you can very simply do that in the sense that you can create a new alternative data sets, and then you can get people slowly migrating to this new one while keeping the other one existing, and then when the very large old one is also not being used and not a dependency for everybody else, you can safely remove it and everybody will have migrated to the new one. So that’s how you can, just like in the software world, you can do migration of these datasets.
No, that actually makes a ton of sense. Actually, I’m going to pull back one of your call-outs about data warehousing, because what you’re talking about here is actually very similar to how we would typically migrate a data warehouse anyways, except that from the perspective of DDS, it seems like from an operational perspective, it’s well designed for this purpose as opposed to in the traditional data warehousing world, we’re usually required months upon months and often didn’t work anyways, because of the fact that we never actually understood what all of our dependencies were. So then this actually naturally segues with my next question, which is on, well, then how does a Lakehouse architecture fit into all this? Because it seems like you want some of that almost like warehousing capability, but considering the vast amount of data that you’re talking about, that’s like back to good old fashioned Data Lake. So just wanted your perspective on this.
So the system like Lakehouse allows you to have a much more atomic view of the changes and to have a globally consistent view of every for everybody. So when it is coupled with a system like DDS, well, essentially what you do is doing some commits of changes of work, this is really what a data Lakehouse offers on top of a regular warehouse. And this prevents anybody from suffering from corruption because of deadline flight. So that makes a really big difference.
No, that makes a ton of sense Tim. So then what naturally is happening here is that because you’ve got all this data in flight, does that now also imply that you’re basically storing all of the data all the time, or is it more likely in reality, you’re actually just migrating people, sorry trying different experiments with your data, just like with your machine learning, seeing if they work and then only keeping the artifacts that are necessary. I’m just curious what would the productionization flow look like for something like this?
Yeah. So I think it goes a bit to the natural tension there is already in the software world between your notebook and your IDE, where you write some code. And the notebook is a great place for experimenting, trying a lot of parameters and everything, but usually you don’t want to keep all the, all the experiment that you want to do. This is more a big scratch pad where you can experiment a lot and you are usually alone in trying and doing all these experiments. You are making multiple runs, you’re retrying with different parameters and so on.
The next to that, you have the paradigm, I would say, closer to the ML engineer or the software engineer, which is all the work that you do on codes, which is also versions, which is usually reviewed, and which is merged into a much larger system where you have more guarantees of stability. And DDS is meant to help you on this part mostly while making sure that every piece that you depend on when you run inside your notebook, you can assume that it is going to be the latest, the most consistent parts, and that you’re not going to use some stale version of your data.
So in that sense, it is helping you ease, and this is going to help you to ease your mind into the comfort that this is going to be the latest version of the data and of the code at the same time. You do not need to have to think, do I have the latest library to access my dataset? I also know with that I’m also accessing the latest version of the dataset itself. You don’t need to think about these two as being different. After that, you can use a system like MLflow, for example, to experiment, find the most optimal set of hyperparameters for what you are doing, and once you want to go to production, usually you do not want to rerun all the experiments, you just want to have one side of the model and you want to plug this model into the rest of your system.
And this is where DDS can come back, when you tell it what your model is going to be, and then you put it inside the pipeline, everybody depends on that. The changes can propagate, and you can consider this experiment as being done for the moment.
And I guess with the productionization of your DDS pipelines, I’m curious, you’ve already mentioned the potential integration with them on flow which makes a ton of sense. I’m also curious from the potential integration with Delta, for example, like from the standpoint of whether it’s your debugging of your data or time travel or schema evolution, I’m just curious from that standpoint, where does that fit in from the standpoint of productionizing your workflows with DDS?
Yeah. So this is, I would say, a complimentary approach to it because as we discussed before, DDS has no notion of what is a derivative data set from another. Whenever you make a change, from its perspective, it is going to be completely new. Now, if you’re using like a Spark pipeline on Delta, for example, then there are a lot of very smart things you can do to not have to rewrite the whole data set. This part, I would say, is much more of a work in progress, and I would say much more so of a research topic, because being able to take the pipeline and being able to find the derivative of another pipeline and exactly what needs to be changed between two diversion of it, it is not trivial.
You can approach it mostly at the commit level, when time comes to actually write the data, then you can say, okay, like this new table had the same columns plus a few others and then you try to see which one actually needs to be written. Or you can approach it at a much higher level directly at the plan itself and then see what really are the differences when you’re going to write, what you really need to add. And I want to say that for this topic, the jury is still a bit out into which one works the best, and I think there are different trade-offs depending on the circumstances.
I would say that in general, when you work in an enterprise, you tend to have not so many very large tables. You tend to start with a few extremely large tables, and then you have a lot of refined version of it which tend to be smaller and smaller and smaller. And when you go into the smaller views of your original dataset, then it’s usually not very big, it can be into the gigabytes, into the hundreds of megabytes, even if we go up to 10 gigabytes, 100 gigabytes, it’s usually cheap enough to simply regenerate everything all the time, given the other constraint that you have about being correct, being reliable, and building also a whole pipeline around it. So I would say, that these two usages are complimentary in that respect.
Well, all of this is super fascinating and interesting. One thing that I’m really impressed by Tim, is your strong background in data engineering and big data. And I’m curious, what skills would you recommend data scientists to pick up so they can go off and be able to develop their own packages like this, or think at scale about how do they productionize their machine learning pipelines?
So, yeah, this is a very broad question, Brooke. So let me see how I can address that. I would say that even for the most hardcore data scientists, I would say, about 80% of your time will be spent on processing data, not on writing models, not on finding the best type of parameters, not even on having the inventing some new ways of doing some cool machine learning. It’s is going to be simply, how do I take my data and how do I put it inside this form here? How do I take my wrong data, make it fit into the square peg of a data pipeline, and I need to bend to make a plug between the two. And this is where I think that a lot of the basic skills for software engineer can really help.
And especially when it teaches you how to be simple in that respect, which is using simple data structures, using basic functions, using some static code analysis, all the tools that check your codes, check that you’re not reusing some variables and especially teaches you not to use fancy features in your favorite programming languages. For example, in Python or Scala, you have properties, decorators, you have five ways to do mixing and trades and all of that. It is very tempting to use it, but in practice, it just make your life, I would say, as a data scientist much harder after that when times come to understand what you have been doing. That I would say the direct skill, another one that is, I think, really helpful for data scientists to pick up from software engineers, is that the software engineers always have to think about trade-offs, am I spending more time on making something correct?
Am I spending more time on making it faster, or am I spending more time on inventing new ideas? It is always going to be a trade-off between these three points. And software engineers are trained by default to be given impossible task to a completion immediately, and to do that perfectly. Data scientists, I would say, sometimes can learn from them to see how they can trade-off all these different aspects, and that is going to make them I think very effective after that in the long run.
That’s an excellent advice, especially keeping things simple. I think that’s one of the areas where data scientists really struggle is they want to build these complex fancy models, they want to be using the latest TensorFlow add on feature, but simple can actually be better. And so I think that’s really good advice that you provided. And so I just wanted to finish off today by saying thank you so much for joining us on Data Brew, Tim. I always learn a lot from these discussions with you, especially about treating data as an artifact, and if there’s one thing I want everybody in the audience to take away from the session is treat data the same way you treat your code. If you don’t version your code, you’re not going to version your data, but if you version your code, you should definitely start versioning your data, so you don’t reinvent the wheel. So once again, thank you again for joining us here on Data Brew, Tim.
Thank you, Brooke. Thank you, Denny. It was a pleasure.