Season 2, Episode 3
Infrastructure for ML
Adam Oliner discusses how to design your infrastructure to support ML, from integration tests to glue code, the importance of iteration, and centralized vs decentralized data science teams. He provides valuable advice for companies investing in ML and crucial lessons he’s learned from founding two companies.
Adam Oliner recently left his role as Head of Machine Learning at Slack to found a company that will help every business get value from their data. Before that, he was Director of Engineering at Splunk, leading a team doing data science and machine learning. Adam was a postdoctoral scholar in the EECS Department at UC Berkeley, working in the AMP Lab, which specializes in cloud computing and Big Data. He earned a PhD in computer science from Stanford University and a MEng in EECS from MIT, where he also earned degrees in computer science and mathematics.
Welcome to Data Brew by Databricks with Denny and Brooke. The series allows us to explore various topics in the data and AI community. Whether we’re talking about data engineering or data science, we will interview subject matter experts to dive deeper into these topics. And while we’re at it, we’ll be enjoying our morning brew. My name is Denny Lee, and I’m a developer advocate at Databricks.
Hello everyone, my name is Brooke Wenig. I’m the other co-host along with Denny Lee. I lead the machine learning practice team here at Databricks and today I have the pleasure to introduce my longtime mentor, Adam Oliner, to join us on Data Brew. Adam, how about you do a quick round of introductions and then we’ll get into how you got into the field of machine learning.
Thanks Brooke and Denny, it’s great to be here. Yeah, until beginning of January I was head of machine learning at Slack, since left to start my own venture. Prior to that, just giving a quick bit of background, I was on a sort of academic trajectory. So undergrad and master’s at MIT, PhD at Stanford, postdoc at Berkeley in the AMPLab, along with all the Databricks folks who you two know very well. I started a company based on my postdoc project there that did mobile battery diagnosis. And then I ended up joining Splunk where I built the machine learning team over the next four and a half years. And that brings us to today. Great to be here.
Thanks Adam. So while Adam had spent his time as director of engineering at Splunk, he was actually my mentor for a summer internship project. And so that’s how we originally connected and so glad to have you today. So let’s go ahead and kick it off with what are you most excited about with machine learning?
I think we’re very much at the beginning of seeing what people will do with the technology. I think if you look at the data space writ large, most of the progress that you’ve seen over the last 10 years has happened in the kind of structured or semi-structured space. And only pretty recently in the more sort of what I would call natural data space, so images and video and natural language text. It still feels like we are very much at the first step of the process of making use of all of that data. And that means that there’s a whole lot ahead of us, that I’m excited about. Whole industries that have yet to make use of it. 80% to 90% of the world’s data hasn’t really been learned from, I would say.
Very cool. And so I know that a lot of your background in machine learning has been more on the machine learning infrastructure side. What problems do people commonly encounter when they set up machine learning projects with respect to their infrastructure?
Yeah, I think one mistake I see a lot is that when a feature gets into the hands of customers, they don’t think about it as the beginning of the process, they think about it as the end. And I see teams make this mistake all the time and they build some data-driven feature, they release it, they find the performance to be mediocre and then they either roll it back or abandon it. And sometimes these organizations even come away with the mistaken impression that ML is therefore not actually very effective for that problem or on their data, when in fact they kind of quit at the beginning.
Whereas the right way is to build and plan for iteration. You should assume that you’re going to start out with something that works okay and that you’ll need to measure performance, collect more data and iterate until it’s good enough. And that’s partly organizational, like having the political will to invest that time and iteration, but it’s also about how you develop. You need to put in place any of the instrumentation that will inform the iteration process, for example, make the cost of training each iteration of the model as low as possible. And the infrastructure to support that rapid experimentation and iteration is crucial. And so when you see organizations sort of floundering with machine learning, it’s often because they haven’t invested in the infrastructure to make iteration easy, cheap, and effective.
I really liked that you had picked the term iteration because machine learning models just go through iteration after iteration for their training itself, so we should plan for that in our infrastructure. So going back to designing of the infrastructure, what are your thoughts on using open-source tools and building more on an open-source stack versus using commercial and somewhat proprietary tools?
Yeah, I’m going to sort of dodge the question. I think much is made of the components of a system. People will talk about, oh, what are they using to train models or host models? Where are the data stored? And I think that’s fine. And depending on your organization and your needs, I think you can make reasonable arguments about why or why not to use open or closed source implementations of those components. But I don’t see nearly as much discussion about the so-called glue code and that glue code matters.
Even commercial solutions can sometimes be very hard to integrate, especially if your existing infrastructure is in some way different from the norm, which it almost always is. I don’t actually even know what the norm is. But almost every piece of infrastructure writ large is special. It’s a snowflake in some kind of way. And if you don’t think about how things are going to get wired up, how data is going to flow through your infrastructure, how you’re going to handle things like versioning and data governance and compliance concerns, it kind of doesn’t matter whether any individual component is open or closed source, you spend all of your time working on the glue code.
This is really interesting because there seems to be an implication then that whether it’s the infrastructure itself to support everything or the glue that gets it all together, an exorbitant amount of your time is really spent on getting that set up. Would that be a fair assessment?
I think glue code generally takes both a lot of upfront development time and resources, but also has an ongoing cost because it’s the code that needs to change whenever the surface area of an individual component changes. So if an API changes or if you swap out one components implementation for another one, often you have then go back and change that glue code. I think another way to think about it is how you draw the boundaries of the components in your system. So if you have a robust engineering team, then maybe those pieces are relatively small. The sort of microservices type of architecture where every component has a very small finite purpose and then everything else just gets kind of wired together.
But you can also take a more solutions focused mindset where you say, actually, I only really care about putting the data in on this end and getting a particular artifact out the other side and everything that’s in between, that’s not important to my business and I would just like that abstracted away. In which case the glue code then is internal to that component and you kind of don’t have to worry about it.
Right. But then isn’t there an implication as you’re trying to manage all of this glue code that because different teams in the same organization may end up actually having very different types of glue, per se?
Yeah, that can certainly happen. And you see sometimes the components or the glue code have to make concessions to what resources are available. And so it often won’t be as generalized as maybe you would like and so that has longer term costs down the road and things like that. So, it just tends to be a little more invisible when people talk about how to implement or integrate a piece of infrastructure.
Got it. Well then I guess this naturally segues to my next question, which is, which you sort of alluded to, but any advice on how you manage all of this? How you manage this glue, how you manage the building up of the necessary infrastructure?
I think you manage it the same way you manage other software, which is to say that you try to build good abstractions around it, think about reusable components, document it. Just don’t treat it as an invisible thing that you’ll figure out later, but really think about what that glue code needs to do and think about that as a first-class sort of entity.
So I like that you had said, don’t treat the glue code as something invisible. You’re going to see different colors of glue when you put it together. But what about other aspects of combining these different components like testing? How do you incorporate unit testing, integration testing when you’re working with machine learning? Because typical software engineering uses all of these concepts, but are these required for machine learning or are they just nice to have for machine learning?
I think for the glue code in particular, this is where you think about integration tests as opposed to just unit tests. So the glue code has a responsibility for end to end properties of these components wired together and so that would be sort of where integration tests would shine. But I think sort of stepping out a little, a level up and thinking about machine learning more generally and how would you go about testing it, again, it kind of depends. I think any organization with finite resources has to make trade-offs. And the fact is, some features are just not mission critical, and maybe you can get away with a model that’s mediocre because it’s better than nothing. And a mediocre model is way cheaper than building a model that’s great.
And maybe the feature isn’t performing a function where issues like ethics or fairness come into play. Maybe the data driving the feature changes so slowly that ongoing monitoring for a model or data drift simply aren’t necessary. But if the software is driving a truck or making medical decisions or something similarly important, you need lots of tests and guardrails because the cost of bad behavior on the part of the software, far outweighs the cost of implementing those tests. So it’s a balance between rigor and budget as so much in software engineering is.
Oh no, that is so true. In one of my past lives we ended up only getting I think like a 1% improvement using machine learning model and that’s all it was, but it still did translate into the tens of millions of dollars. So we’re like, you know what, okay. Versus exactly to your point. If it was something medical, yeah, maybe we want a little bit better than 1% improvement here.
Yeah. I think much is made out of the last 2% problems. Google is spending however many hundreds of millions squeezing a little bit more out of their advertising systems because that is billions of dollars for them and that’s the sort of space in which they operate. But I would say that almost every other business is working in the first 80% and getting from zero to 80% is huge and doesn’t require the same kind of infrastructure and rigor that you would need to see if again, you’re really got to squeeze every last bit by throwing more and more data, which means it needs to be more scalable.
Or you’re more and more sensitive to the boundary conditions and the sort of, Oh well, we don’t want to hit pedestrians on Halloween. And so we have to think about this relatively minor case and make sure we get our data labeled correctly and learn how to not run them over. Just because we haven’t seen a banana walk out of a bar before, it doesn’t mean that we can hit them because it’s okay, they’re rare. No, you got to plan for that. But most people are working on that first 80% and as long as you generally make the button the right color or recommend the right person to talk to, no one’s going to really know that you didn’t do the last 20%.
So related to that then, since we’re going to really focus on the first 80% in this case, how has this all changed from an infrastructure and glue perspective, with big data? In a lot of ways, everything you’re discussing is relatively straightforward, and I use the term relatively here when it comes to smaller amounts of data. But when you’re talking in terms of vast amounts of data where the data changes over time, from your perspective, how do you deal with all of this?
Yeah. I think the distinction between data and code, that’s a conversation that can devolve into the philosophical pretty quickly. But I feel comfortable saying that data is now part of software in a way that it wasn’t always. It drives the behavior of software in particular. So you may need to manage, test and monitor the data the same way you would monitor your code, but that can present its own unique challenges. For example, a user of your product is usually not empowered to change your code, but they can absolutely change the data that’s driving it. So if you imagine a recommendation system that considers what products a user has viewed, a group of your users get together and systematically view or don’t view certain products or sets of products, they can affect the behavior of that recommendation system for everyone. And in a sense, that’s intentional. That’s what the recommendation system is doing explicitly, but it doesn’t always yield the results that you as the creator of the product might want.
And probably the best known example of this, just to give you the sense that this is not new per se, it might just be new and more widespread, is Google bombing. So back in 2016, a bunch of users linked the phrase miserable failure with George W. Bush’s biography. And nobody hacked Google to make that happen, they would just take that phrase and link it to his biography. And that’s what page rank does is it figures out that this phrase is associated with it and of course it went to the top of the search engine rankings. And that was just a bunch of users effectively changing data. And that changing data changed the behavior of software, in that case the Google search engine.
And that sort of thing where data drives the behavior of the software is just becoming ubiquitous in a way that it wasn’t before. Now increasingly it’s hard to point at a piece of software, especially one that’s commonly used, especially in the cloud space, where it’s not at least deriving some of its behavior from something it learned from data. And so now the users are kind of a part of your system and can change its behavior in a way that was not always so prevalent.
That’s a super interesting example that you provided where users have… You could empower users to change the system, even if they don’t have access to the underlying code. But I do want to go back to something that you had said earlier about that first 80%, and that’s where most people are. How do you know when good is good enough with machine learning? Because you could always keep going to get those incremental improvements. How do you decide as an engineering manager when you’re going to say good is good enough?
That’s a great question. I think any testing strategy, you’re balancing two things. One, making sure that it’s doing the thing it’s supposed to do, and two, making sure it’s not doing the things that it’s not supposed to do. And one of those is usually a metrics driven. So if I have say a ranking system or a recommendation system, there’s presumably some business metric that it’s trying to drive. And so maybe that’s CTR or conversion rate or something. And if the model is driving those metrics in the direction that you want and they’re driving it far enough that sort of justified the cost of the project, then that’s great. And on a positive side, you’ve done enough. And then you can go back to the business and say, “Hey, I drove CTR up by 5%. Do you want to keep investing in this and see if we can get it another 5%?” And the business can decide if it’s worth the investment.
On the flip side, there’s also the, do no harm. So make sure it’s not doing a bad thing. And so there you might have to ask questions about, yes, on average maybe the CTR or whatever the metric was went up, but maybe there are certain subclasses of users, like certain demographics for which it actually went down and as a business you might care a lot about that. And so you might want to put testing in place to make sure that it’s not just going up on average, but that for all of the sort of cohorts that you care about, that it’s still going up. And that’s again, just sort of a very business and feature specific kind of question, but I think that’s the structure that I usually think about is, are you doing the right thing and not doing the wrong thing and enough so that it justifies the future?
Right. And then I think exactly using that example that you’ve provided here in terms of, if there’s a particular cohort that’s going down, that that is a very good example of the fact that you do need all that infrastructure in order to be able to support recording and tracking and analyzing so you can see what’s going on. I presume that you’ve seen this pretty much through all the iterations of you running the various machine learning companies that you have?
Yeah. I’d say that the infrastructure is important in a couple of ways. I mean, in many ways, but the not just iteration in the sense that, oh, you’ve released the first one and now you’re going to iterate. Because obviously you want that to be as inexpensive as possible and give you kind of as much control for sure. But there’s also this startup cost that I think doesn’t get talked about a lot because for organizations that do a lot of machine learning, you pay it kind of once as an organization, now you have it. And now you can ask questions like, “Hey, if we ranked this better, would it improve this business metric that we care about?” And you can just try it and you have the infrastructure already.
But if you’re a business that doesn’t already do a lot of machine learning or big data related projects, in order to just ask that simple question of, would we benefit from using data to improve recommendations, for example. You can’t even ask that without investing in a bunch of upfront infrastructure. And so that sort of initial cost also can be a barrier to even entering the machine learning space. You never get to the point of iterating because it’s too expensive to even try a thing. You don’t know how far the 80% is away from where you are.
So if you don’t have any current infrastructure, would you advise that people just go buy a pre-built solution for their problem or are those generally over promised and too much hype?
It seems like a leading question, Brooke. I would say that a lot of the offerings in the auto ML and solutions space, there’s definitely a ton of hype there, for sure. I think if you have a very specific use case in mind, it can often be fine to just start with someone selling a solution to that use case. I think though, if you step back and ask yourself, what is your data strategy as a business? What is your machine learning strategy as a business? You’ll usually find that you have more than one use case. And if you can afford to invest in some of that infrastructure, just get you that narrow path to trying out that first use case, that often sets you up well for the second use case and the third use case.
And if this is part of a strategy that’s relevant to you, then that’s worth making as an investment because otherwise, if there are 20 use cases that you in the long-term sense want as a business, that’s 20 vendors. And so wouldn’t it have been better to maybe have one more infrastructural vendor and build a little bit of lightweight solutioning on top of that. Certainly as part of just vetting out which of these projects are going to be fruitful or not.
So then I guess, invariably, this sort of leads to my next question in that case which is, would then instead of actually going with a specific vendor, would it not just work out where I just hire a bunch of interns or recent college graduates and have them just Python pandas their way out of the problem and you’re good to go?
Well, I think maintainability needs to be part of the conversation. I love having interns by the way. I’ve had some amazing ones in the past so it’s set me up to be primed to really like having interns, Brooke. If you throw interns at the problem, I think you will quickly find that their tenure with the company tends to be short. It’s usually a few months. And if they have built some critical piece of infrastructure and then they leave, you’re left with the question of who is now maintaining that.
And I think infrastructure in particular, it’s more like having a child than shipping a feature. If you change the color of a button, you ship it, maybe someone will have to go and change it in a year, but basically it’s gone off to college, so to speak. But infrastructure lives at home like in your basement. It is a thing that you are feeding, doing care and feeding for on an ongoing basis. And so I think to the extent that it’s crucial for your business, you should really invest in making it part of it. But I think maintainability is a big part of that question.
I know a lot of data scientists are very focused on building models. How can I get the model of the lowest RMSE or the highest R-squared, but in terms of actually deploying and maintaining that, whose responsibility is that? Should that be the data scientists that develop that or should that be a separate team of machine learning engineers that are responsible for all of the ML ops?
That’s a great question. I have a whole rant about certain models of ML team structuring that I like to share. And in particular it argues that the consulting model for machine learning engineering teams often does not work well, partly for this maintainability reason. So if you imagine this model is you have a central ML team and anytime any team in the company wants to ship an ML feature, they pluck from that pool and that person goes and helps them build and ship the feature and then that ML engineer is returned to the pool. And there are a lot of problems with this model, at least a dozen that I have written out in excruciating detail, but one of them is that that engineer has a monotonically increasing list of features that they are responsible for maintaining. So when they shipped their 10th feature and returned to that pool, now there are 10 things that could potentially break where they’ll have to get called in for.
And that kind of fan out for an engineer, not only is it a recipe for them very quickly becoming overloaded if multiple things break at the same time, but it’s also not a very satisfying career development story. If you want to as an IC progress, especially to staff or principal levels, you usually want to have larger, more impactful projects rather than a laundry list of 15 different things that you built for 15 different teams. And so there are a lot of reasons. Maintainability is another one. These other feature teams would have to keep coming back to the pool anytime they want to do something additional with that feature. And I mentioned at the beginning that you have to plan for iteration. And this model is particularly poorly suited for iteration because you’ve divested yourself of your machine learning resource having just shipped the feature.
So that means either they stay on the team for long enough that they can iterate to make it great. In which case, at some point they’re just part of the team. Just make them part of the team and move on. The places where I’ve seen centralized machine learning teams work really well is when they’re infrastructure teams because then they are building infrastructure and developing for the longterm. Building for iteration, building for experimentation, all the things we’ve talked about being so important. And you want to have a team that owns that on an ongoing long-term basis, because it’s really crucial for your ML and data strategy. But having ML engineers inside of your feature teams is great, but they should stay there.
That makes a ton of sense. In fact, actually reminding me of the diatribes I would go into in the past precisely for BI teams back in the day. It’s exact same problem. You’d have central BI teams and exactly to your point, maintainability became a problem. Giving people career paths. Them working on small projects versus one very impactful project. So it’s interesting how when it comes to these type of problems that we’re just repeating the same thing over and over again.
But now back to the question here, how do you find that balance though between what do you centralize versus what you de-centralized? Because in the end, you’ve already sort of called that out. You want some of those folks to be able to be de-centralized so they can actually have some growth, but then you still want an infrastructure team that actually can focus on the idea of tooling, iterations, infrastructure, glue, things of that nature. I’m just curious, how would you break that down?
Yeah. Broadly speaking, I think about it in the following way that there’s going to be a central machine learning data infrastructure team, which may or may not be one team, but that is a function that needs to exist. And they’re responsible for data and machine learning infrastructure. The question of how much are they responsible for, will depend on to what extent the infrastructure that they’re building is reusable across different feature teams. So if you find that there are 10 different feature teams that are all using the same tooling for training models or doing integration tests once they’ve deployed those models, that sort of thing, then that seems like infrastructure that ought to be shuffled over to the infrastructure team so you’re not building it 10 different times.
But if there’s just one team that’s using a piece of software as a one-off, there’s no need to necessarily move that to the infrastructure team, the team that’s using it can use that. And so that requires that there is somebody who is paying attention to these efforts across the company, looking at all the different teams that are learning from data and putting that into production in various forms and thinking about, are there repeatable patterns here? Are there components that are being rebuilt or reused that we can then pull off onto centralized infrastructure? And that’s usually the role of a head of machine learning or head of data or something similar.
Also a role that Adam has played in his previous career, head of machine learning at Slack. Cool. So I would like to transition and just ask you, what advice do you have for companies that are planning to invest in machine learning? Do you have different advice for people that don’t have their infrastructure and machine learning is brand new to the company, versus companies that have already set up the infrastructure? What advice do you have to give to those different scenarios of companies?
I think if you don’t have infrastructure for doing this stuff already, it needs to be a part of the conversation. You need to decide how much of this are you going to put in place upfront and to what extent you can get away with sort of a spit and baling wire implementation of something and actually test out whether it satisfies your business needs. I think that is a conversation that needs to be had. For organizations that already have a lot of that infrastructure it’s a different kind of conversation. Then the challenge is knowing which of the thousand different features that I could be investing in are going to be most fruitful for my business? And this is where having that infrastructure is handy because you want to do rapid experimentation. So, it shouldn’t take you two sprints just to try out whether or not a particular feature is going to work or not. It should be something that you can test very rapidly.
Oh no, actually I wanted you to finish off because it’s sort of funny how basically everything’s about iterations. Everything is about going fast and iterating just like ala machine learning. Whether you’re building the infrastructure or developing whatever product or features, it’s still about iterating through. And so it seems to be a recurring theme here, that’s all.
There’s so many features you’re trying to ship that make use of data and machine learning, where you have no way, a priori of knowing how effective it is until you put it into production. There’s just no offline dataset that is going to tell me, for example, which things Slack users are going to click on in their search results. And picking on Slack for a second, since I don’t work there anymore, we couldn’t look at any of our users data. Under no circumstances did I get to see what people were searching for at Slack or what people were clicking on when they got their search results. All we had were sort of the overlaying metrics of someone ran a search, they clicked on a thing and they didn’t come back. I guess they were successful. And that’s the sort of granularity that you get.
And so if you wanted to just sort of sit and think really hard about what’s the best way to do search ranking at Slack or somewhere else, you’re sort of working from a place of ignorance. The only way to really find out is that you make a hypothesis and test it by putting it in front of customers. Run an AB test and see whether or not it improves CTR search success rate or whatever it is that you care about. And there is iteration built into that sentence. You are implicitly as part of the development process, getting something in front of customers so you can collect data and feedback so that you can then ship the one that is maybe going to actually solve the problem better. And there are so many problems like that where no one is going to come and give you some golden dataset that has exactly the labels that you want for your specific problem. You’re going to have proxy labels, you’re going to have guesses and heuristics and all sorts of other stuff. And at some point you just got to see whether or not it does the job.
Excellent. Excellent advice. Well then in terms of advice, I did want to ask this question, if you can, of course. Can you share a little bit about your current startup and since you may be trying to avoid that one a little bit, what are some of the lessons that you found in terms of learning from founding what I believe what’s now your second company? So yeah, I’d love to learn a little bit more from both of those perspectives.
Yeah. Thank you. I mean, it’s still very much in stealth and I’m still very much at the beginning of the journey, but as a teaser I’ll say this. It’s true that data is at the heart of software more than ever before, but the vast majority of data remains cumbersome or challenging to work with if you think about images or video or natural language text or the like, and that constitutes 80% to 90% of all data and every business has lots of it. But you effectively require deep learning in order to work productively with that data and the expertise and infrastructure to do so is just still out of reach for most businesses. So I aim to change that and that’s my teaser.
And as for lessons, I’ll say a couple of things. First, I think this is a great time to start a company if you have the luxury of being able to do so. And it is a luxury. There’s a lot of available capital if you’re fundraising and at this economic moment, I’d much rather be building something than trying to sell it. The second lesson, and this is sort of my secret to success so don’t tell anyone, just between us three, I surround myself with extraordinary people and then I strive to be worthy of their company. That’s my secret. And while starting this venture, I’ve been reaching out extensively to a network of amazing people that I’ve built up over the years for help with fundraising and recruiting and customer introductions and so on. I cannot imagine how hard this would be without that network. So now it’s up to me to make myself demonstrate me worthy of that company.
Well, this has been a great session with you Adam. Thank you so much for taking time out of your Stealth company right now to join us on Data Brew and for all of the sage advice you’ve provided on glue code, infrastructure, iterating on machine learning projects. So thank you again for joining us today, Adam.
It was my pleasure, Brooke and Denny. It was great to be here. Always good to talk to you.