Season 2, Episode 1
Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley in 2009, and has worked broadly in datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Today, Matei tech-leads the MLflow development effort at Databricks in addition to other aspects of the platform. Matei’s research work was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).
Welcome to season two of Data Brew by Databricks with Denny and Brooke. This season, we’re focusing on machine learning. The series allows us to explore various topics in the data and AI community. Whether we’re talking about data engineering or data science, we will interview subject matter experts to dive deeper into these topics. And while we’re at it, we’ll be enjoying our morning brew. My name is Denny Lee. I’m a developer advocate at Databricks, and one of the co-hosts of Data Brew.
And hello everyone. My name is Brooke Wenig, the other co-host of Data Brew and Machine Learning Practice Lead at Databricks. Today, I have the pleasure to introduce Matei Zaharia. He is the Chief Technologist at Databricks, assistant professor at Stanford and the original creator of Apache Spark and MLflow. Welcome to Data Brew, Matei.
Thanks a lot. I’m really excited to be here.
All right. We have a packed agenda. We have tons of questions that I’m sure our audience would love to know the answer to for you. First one we’d love to kick it off with is, how did you get into the field of machine learning?
Great question. So I think I started becoming interested in machine learning when I went to grad school at UC Berkeley and I had been working mostly in computer systems, distributed systems before that, but Berkeley had this unique lab that brought together computer systems and machine learning people. They saw a lot of machine learning begin to happen at scale. They also saw machine learning being used to manage large data centers. And so they had this idea of putting together one of the largest machine learning groups started on by Michael Jordan with a number of systems, faculty and people interested in the boundary between them. So in my cube at Berkeley, when I came in as a grad student, I actually sat next to Percy Liang, who is now a machine learning professor at Stanford and Lester Mackey, who also got a professor job at Stanford and all kinds of other people who did machine learning. That’s where I started to learn about it.
And so I know that your focus has been on scale. And so how do you do machine learning at scale, leveraging Spark? And I know in the past few years you’ve transitioned to leading the MLflow team at Databricks. Could you talk a little bit more about what MLflow is and what are some of the key problems or key challenges that MLflow addresses?
MLflow is basically a machine learning platform. So it’s a software platform that lets you manage how you develop and then deploy the machine learning applications. And it’s all about making the development and maintenance process smoother for machine learning. So making it easier to build production applications and then to operate them after. So we found that for most machine learning users out there, the hard part about machine learning is really getting the applications to be production grade and keeping them that way. And often even after teams build the first application, they would find out they have to spend half their time or more just maintaining it and making sure it keeps working.
That’s what we wanted to simplify. So it’s designed so you can use it with any machine learning library and algorithm you want. It doesn’t actually provide the algorithms, but it will do things for you like tracking metrics about your application for experiments, or once it’s running in production, packaging up your model in a way that can be deployed reproducibly in a bunch of places, and also letting you collaborate and share models. So having an environment similar to GitHub where you can review models, you can see changes, you can see all the data that they depend on and how they’ve been doing and you can collaborate on sharing those with the team. So these are the kinds of problems that we’re tackling.
Matei, that sounds super interesting, but then I think this naturally leads us to the next question is, well then how does the cloud basically exacerbate these problems? You’ve mentioned all of these problems that you’ve been seeing and that you’re addressing for whether it’s the customers or the practitioners in general. How does the cloud basically befuddle this or make it a scale problem or so forth?
I think there are a number of different ways. So the first one is actually learning the machine learning itself on large scale data. Machine learning is algorithms or systems that learn from data. So obviously if you put in more data, it’s likely that they’ll do better. So there’s that question of scaling it up. And that’s what we tackle with Apache Spark and with all the distributed machine learning systems we support as well, including TensorFlow and PyTorch and so on. But another interesting thing with the cloud is it does make it easy in theory for teams to build and then deploy a lot of different applications because they don’t have to worry about infrastructure.
So what we see as every company that starts using machine learning, first they have one or two use cases that have clear value and are really important for them to do. They put together a team. They got all the data pipelines and infrastructure, but once they put out those first one or two use cases, they have a backlog of tens, maybe hundreds more that they want to do. And they say, “Look, it’s cloud infrastructure. I can just click to launch more machines. And we have all the data and we have the team that knows how to do machine learning.
So how can we actually get lots of these and allow more of our company to use machine learning?” So to be able to do that and to have those teams be able to scale and to really focus on designing new applications, it needs to very easy to productionize and maintain these existing ones. And ideally, it’s almost automatic where you launch it and then you just hear if something is going wrong. That’s where this infrastructure becomes especially important. And we see a lot of companies using it for that reason.
So speaking of infrastructure, people typically have different environments. They’ll have their dev, their staging, their test. What recommendation do you have for people trying to build models and promote them across these stages? Do you typically see people retraining their models and staging them in prod, or are they just reusing the same artifact that they developed in the development workspace? What do you typically see and what advice do you have for customers that are actually trying to productionize their ML models?
It’s a great question. And this can be a little tricky to do for machine learning, and it also depends on how companies organize their workflow in general. So, one thing we definitely see is, as you’re launching a new model, you do want to keep track of the state of development that it’s in and maybe to have either automated or manual checks that go on to push it into other states. So actually in MLflow, one of the central pieces of it is the model registry, which is an environment where you can define some models that you want to have. Let’s say for example, recommendation model or churn prediction, and then people can post different versions of them in there. And you can tag which versions are just development, which versions are staging, which ones are production, when they each came and you can comment on them, or you can actually connect automated systems through webhooks that will run an automated test.
So just tracking those is one important aspect. In terms of what data you train on and so on, we sometimes see companies that separate, let’s say the dev or staging data from the production one, but for machine learning, it’s a bit of a problem because you want to make sure that it works on the real data. So I think through some form or another they’re going to want to do even development on production data. And to make that work really well you probably want a way to create clones of this data that maybe are read only in the dev and staging environments, and also to keep track very carefully of training sets, test sets and how you separate these so that they’re consistent. So you don’t accidentally get leakage by training on the same data that you test on.
So what we see in Databricks, a lot of this happens to Delta Lake, which is this structured data management layer over S3 and other cloud object storage that makes it very easy to keep track of multiple versions of data sets, create clones of them, modify them. And this is also one of the things that MLflow integrates with to tell you exactly what version of data was used. So I think the combination of data versioning and management through something like Delta Lake make it very easy for people to create snapshots of the production data sets and remember what was used for what, plus the combination of explicitly tracking your models and the stages sets you up in a good spot to do this.
Thanks Matei. I think you’ve really covered the concept of the underlying issues when you’re working with machine learning and this concept of data reliability. I think that naturally segues to my next question which is then, how do lakehouses in general make machine learning more robust when it comes to your production environments?
So the lakehouse is this new technology trend we’re seeing where you take data lake systems. These are systems that let you do very low cost storage, such as Amazon S3 or Azure Data Lake storage or Google Cloud storage. And this system has historically had a low level interface. They’re basically just file systems or key-value stores, but you can actually implement powerful data management features on them, similar to what you’d have in a data warehouse. So things like transactions, cloning of data, views, data versioning, and so on. And so by adding that, you can suddenly have this very low cost storage. You can easily ingest new data into it, and it becomes really easy to manage.
So, Delta Lake is an example of a system that can let you build a lakehouse. It adds transactions, schema enforcement, different types of indexes to speed up access as well on top of a collection of files. And it makes it a lot easier to maintain that. And we see a lot of organizations using these to manage their machine learning data sets. We also see them using these as feature stores where you manage the computer features you get, because that’s also an area where you want to keep track of multiple versions, go back in time when you want to compare models or compare algorithms and so on. So it’s an important thing. And a lakehouse also looks nice. You can see one right here.
Yes. That is a beautiful lakehouse. Where is that photo taken, Matei?
It’s actually somewhere in the Netherlands, apparently. We have an engineering office there too. It’s really nice out there.
We definitely love visiting that office out there whenever we get a chance to travel. But I do want to ask you a follow-up question to that. So you’re just talking about the need for Delta or a lakehouse to be able to version your data, the need for MLflow to help you track your model, any hyperparameters artifacts, et cetera. How do you combine this to be able to detect model or data drift?
That’s a great question. I think it is a little bit application dependent today. We’d love to have a general solution that works for everything, but I think people do have to do something custom there in most cases for it to work well. There are actually a bunch of interesting libraries out there emerging that help with this. So in general, you want to know what’s the difference between let’s say a data set I got today versus what I had yesterday. And you might want to see something like this column that used to have a lot of values is now null all the time, or maybe this column that had countries in it, maybe there were 10 countries that the model was trained on, but now there are 11 of them.
That’s bad for machine learning probably because it means you’re never trained on the new one. Or maybe something about the range of a value. So you want to detect these changes. And you can imagine applying the same thing on the predictions that come out as well, which are also a data. So you might say, “Look. My model used to predict whatever, 50/50, let’s say male and female or something on the image it’s coming in, but now it’s not doing that anymore.” So similar things apply. There are a number of libraries that start to help with that, and either let you manually write rules that they enforce, or they automatically try to tell you what looks the most different between two data sets.
One example would be libraries like Great Expectations in Python, or actually the expectations or constraint checks feature in Delta Lake, which lets you easily write some checks you want to learn about the data. So you can see, for example, if a value is out of the expected range or if there are more nulls than you would expect or something like that. That’s a nice building block as long as you’re willing to write your own rules. Other examples of libraries that will also take a difference for you. So for example, TFX Data Validation, part of the TFX project from Google has a way of comparing two data sets and telling you which columns seem most different.
And it also has a schema concept that’s similar to these expectations where you can say what you expect and you can be told when data falls outside of a schema. And finally there are all kinds of anomaly detection and anomaly explanation systems. Some of them are pretty researching at this point. So I collaborated on a system called MacroBase at Stanford that tries to find anomalous groups in data, basically just combinations of attributes where the frequency changed a lot. So these can help as well.
I think in practice, teams usually want to think about the rules that they’ll want and implement at least some of them manually because the problem with automated alerts is that if there are a lot of false alarms, then people will just ignore what’s coming out of them. For example, this is one of the big lessons from the TFX team at Google. When they talked about it, they said they tried all these methods to look for the difference between two distributions. And they would report things like the KL divergence is more than five today or something like that.
And the teams doing operations and even the machine learning teams didn’t know what to make of it. And they ignored all the alerts. But if the alert was this model was only trained on these 10 countries and we’re now seeing data for an 11th country, then that’s something very interpretable. And they could either go in and say actually the model doesn’t care about that country. Don’t alert when it’s these 10, or they could go and try to fix it. So that’s what they discovered. Every alert that comes up needs to be something where you know why it came up and you can either change the expectations so you don’t get that anymore, or act on it and figure out how to change your data.
Matei, that was super interesting. So this actually really segues really nicely into, what is your research group then focusing on these days? How has it evolved?
My research group at Stanford actually works a lot on systems for machine learning and it’s changed quite a bit from my focus. Before I used to work mostly on distributed systems and data management, but now I’ve become very interested in all aspects of actually using machine learning and production. And so we’re doing a few things that are pretty relevant, especially to this MLOps area and some things that are more general purpose machine learning as a whole. So I’m just going to highlight maybe two interesting projects that we’re doing. So one project is about quality assurance for machine learning. It’s related a little bit to what I talked about with expectations about your data.
It’s called model assertions. And it’s basically a way for you to write assertions or expectations about what your model is predicting. For example, you might say if I have a model that looks at cars driving around in a video or something, I expect the location of each car on nearby frames to be close together. And if it’s far away, then it’s probably misidentifying which car it is or something like that. So what we found here is you can write very simple rules for what you want to check for. It could just be a Python function that looks at the output of the model, and you can actually capture a lot of the common misbehaviors of models.
And then we also have various ways to use these in the training process as a supervision signal, where you can train a model that avoids those failure modes. So we use this, for example, on basically some autonomous vehicle data to help correct issues that those have with perception. We use that on some medical ECG data, and basically whenever a model is going wrong, usually when you show it to a person they’ll say,” The model tends to be bad in terms of tracking the car across frames.” They have some way of explaining what’s bad. And if they can write code that detect some of those, you can actually start to fix it.
That’s one area that’s pretty cool. We have an open source release but it’s very early on, and we’d love to expand this idea even more. The one other project I’ll mention, because it’s really interesting and it actually ties back to distributed systems, is a project we have on natural language processing and applications which is called ColBERT. So Col and then BERT. And the basic idea here is there’s so much interest now in these huge language models like GPT-3, where basically you run it over a giant collection of texts and you memorize a lot of stuff, and now you have these billions, maybe trillions of parameters in there, and then to run a task you have to feed in that and multiply by all those parameters, do some algebra to get a result.
So what we’re trying to do instead with ColBERT is maybe closer to how people think about stuff, which is have something that can actually do lookups into some memory instead of just doing a ton of computation on every input. So it’s something where you can have a collection of documents and you have some grade that comes in and you decide based on doing some embeddings of the query or of the words that you decide about some documents you want to look up in the embedding space and you do some lookups into a table, and then you take those results. And you look at your question again, and the relevant information from the documents, and then you answer your question.
It turns out that this method can work really well. And it’s a lot more computationally efficient because you’re not scanning through essentially every document every time. You’re just doing lookups. So we have really nice results on basically search information retrieval, question answering and other applications using this. So it’s just an area that I think is super interesting. And the insight is on the system’s side. We can make this search for matching documents really efficient. It’s a lot faster than running everything through GPT-3.
Well, this is some pretty amazing stuff. And so this actually leads right to my next question. How do you balance all of this research that you’re doing and being a professor at Stanford and being the CTO of Databricks all at the same time?
It’s definitely a lot of stuff to do. Fortunately I have great teams in both places. My grad students are doing really great research and I just need to give them some advice and so on, but they’re doing a lot of the work, and obviously same thing with the teams at Databricks. So, in terms of how I spend my time, I just started to have some days or whatever that are dedicated to each of them so that everyone knows where I am and when I’m available and so on. And it works pretty well. And of course at each time I’ll be focusing on different things in each place, the things that I spend a lot of time hands-on with, writing or talking or designing things. It’s been interesting. It took me a while to learn how to do it, but it’s definitely interesting because I learn a lot in both places about what people care about in practice and what’s possible and not possible on the research side.
I’m curious, which way does it tend to flow? Is it things that you’ve encountered with Databricks customers that tends to influence your research or is your research that tends to influence some of the direction of the Databricks product?
That’s a good question. Usually I think so. For research, I’m usually trying to do things that are quite far away and are higher risk. So not necessarily something that will solve a problem tomorrow that a customer is running into, but it’s nice to know what the actual challenges are in practice, because in the research community, they’re often a little bit far away from what users care about, or even if they work on a problem, let’s say data quality, they might frame it in their own way, which doesn’t match what users actually have. For example, they might say, “I got a data set once, and I want to know what’s wrong in it,” but in fact, most people have a pipeline and they can compare stuff over time.
So I think it’s really important to understand some of those problems and some of these, for example, the model assertions thing, came out of us actually at Stanford trying to use machine learning for something. All the ML people said that computer vision was solved and it was really great. So we were trying to do video analytics, and then we noticed it kept making the same mistake all the time, which is missing the object in some adjacent frames or thinking that one object is two different ones and giving us two bounding boxes. So stuff like that. And we said, “Wait. We can even write a little rule that finds these. Is there any way we can use that to train it to avoid these problems?”
So usually for problems, things that I would see in industry and using things can often inspire new research, I think the way stuff might flow the other way is just knowing what’s possible and what isn’t. So for example, I’m working on a bunch of things to do with data governance and security, and just knowing what are people working on in cryptography and differential privacy and these fields that you could use as a tool and that is useful, although maybe we won’t. Maybe the first problem is not solved by one of those. It was just knowing where things might go after.
Well, I definitely want to know when you get your model assertion library out there and ready for us to use, because I know plenty of customers that would love to use it. I’ve seen cases where someone’s trying to predict the price of something and somehow that combination of features, it could be negative. And that just should never happen. You should never be paying someone to take that product. But I have a serious question for you. Since you straddle both industry and academia, do you prefer PyTorch or TensorFlow?
That’s a good question. I think they’re both good frameworks. I would say in my research group, it’s almost entirely PyTorch. There are some folks using TensorFlow, but it’s for prototyping. It’s very fast and also a lot of research is published with that. In industry, we see, much closer, much more usage of TensorFlow, but I think it’s getting pretty close to 50/50 actually.
Is there any reason why you think industry tends to prefer TensorFlow? Do you think it’s ease of deployment, the Keras API? Why do you think industry prefers TensorFlow?
I don’t know for sure. I think a lot of it has to do with which models they want to use because people start with an existing code base. And if they think there’s a mature code base out there for something and they can just pick it up and use it, they’ll use that. So I think TensorFlow did a great job of packaging a lot of models with the project, and a lot of users started out that way and then they would start and modify the model based on their needs.
Whereas with PyTorch, it’s faster to experiment with and you can find these research papers that you know they all open source their code, but getting someone’s research codes to run on your data and to be easy to modify is very hard. I’ve definitely seen that. We in our group, we spend a ton of time just trying to figure out how to even launch some of the packages we see there that are a few months old. So I think the fact that there’s an engineering team building those helps, but I think over time PyTorch has also added a lot of these built-in models and maybe people don’t do Leeker anymore now, as long as they get their model.
Thanks very much Matei. Our second to last question here is going to be very much less controversial per se. I’m just curious, since our timing is very close to, if not right with the upcoming Data + AI summit, are you planning to present? I’m guessing so, right?
I am planning to present. And I think at this time we’re still figuring out which talks I’ll give, because they want me to give a lot of the talks. But we’ll have some very exciting announcements on Delta Lake, including something that I’m working on that will greatly increase what you can do with that project and how you can manage access to that data inside a company. That’s one thing that I’ll likely be talking about. And then we also have some interesting updates about Apache Spark, MLflow. Koalas is actually the open source, pandas-like API layer for Spark, and some of the things we’re doing in our product around machine learning as well. So I’ll be giving some subset of those stocks, don’t miss the event. We have really great content and some really exciting keynote signup this year too. External keynotes.
I’m definitely looking forward to the event. And then the very last question, also not controversial. I know that you have had many years of experience in machine learning, and I know for some folks getting into the field, it can be a little bit intimidating. And so I want to ask you, what advice do you have for people that want to get into the field of machine learning, but perhaps don’t have any training in the field of data science or machine learning?
It’s a good question. So I think there are two important things. So one of them, I think, is to try to do some things hands-on. There are a lot of examples out there you can try to use and maybe change in a way even before you understand all of the theory behind them. So make sure you learn how to work with these hands-on on your laptop, or something like a notebook or some cloud environment or something like that. And then the other thing I think is good, or at least if you really want to understand it in depth, I do think it’s good to go and either get a book or look at an online course about just the foundations, the theory of what’s happening.
And even starting with simple, not deep learning, just classical machine learning. And once you understand a little bit of that, it will be much easier to see how new things fit in and also what can go wrong and why. And a lot of the terms in the field are… Unfortunately, if you think of it as similar to human learning, and you try to think of the algorithms as people in some way, you can easily get misled. So I think it’s good to know what’s really happening from a mathematical point of view under the hood.
I definitely remember having that aha moment when I realized deep learning is just a series of matrix multiplication and some nonlinear transformation. So, I definitely echo your point of understanding what’s happening under the hood. Matei, I just want to say thank you again for joining us. I know you’re very busy and we took some time out of your Databricks dedicated day to come and join us on Data Brew. And so we really appreciate it here. And just want to say thank you again for joining us.
All right. Thanks again everyone for watching.