Season 2, Episode 6
Erin LeDell shares valuable insight on AutoML, what problems are best solved by it, its current limitations, and her thoughts on the future of AutoML. We also discuss founding and growing the Women in Machine Learning and Data Science (WiMLDS) non-profit.
Erin LeDell is the Chief Machine Learning Scientist at H2O.ai, the company that produces the open source, distributed machine learning platform, H2O. At H2O.ai, she leads the H2O AutoML project and her current research focus is automated machine learning. She has a Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley and a B.S/M.A. in Mathematics. Before joining H2O.ai, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security, and the founder of DataScientific, Inc. She is also founder of the Women in Machine Learning and Data Science (WiMLDS) organization (wimlds.org) and co-founder of R-Ladies Global (rladies.org).
Welcome to Data Brew by Databricks with Denny and Brooke, the series allows us to explore various topics in the data and AI community. Whether we’re talking about data engineering or data science, we will interview subject matter experts to dive deeper into these topics. And while we’re at it, we’ll be enjoying our morning brew. My name is Denny Lee, and I’m a developer advocate here at Databricks. And one of the co-hosts of Data Brew.
Hello everyone. My name is Brooke Wenig. I’m the other co-host of data brew. And I’m the machine learning practice lead at Databricks. And today I have the pleasure of introducing Erin LeDell, chief machine learning scientist at h2o.ai and also the original founder of women in machine learning and data science (WiMLDS), as well as co-organizer of the WiMLDS Bay area group. Erin. Welcome.
Thanks. Thanks for having me.
All right. So I would love to kick it off today with how did you get into the field of machine learning?
So, I’ve been here for a while. I got in during the big data era, which was way back in, I don’t know, 2010, let’s say. I guess I started machine learning, like the mid 2000s. I was originally like a mathematician. So I went to undergrad and did a master’s in math, and then I became a software engineer. And after doing that for about six or seven years, I started to be exposed to machine learning. And at the time there, I didn’t see really a lot of paths to get into machine learning other than to do a PhD. Because there weren’t like bootcamps there weren’t masters of data science programs. There wasn’t anything like that. So I just figured that that was the most logical, most long way to get into the field. So, that’s what I chose to do.
And I did a PhD at Berkeley in California and that was in statistics. Well, specifically biostatistics and I did that for four years. And then during that time I had a number of startup gigs on the side. I was interning. I was part of a couple startups during that time as well. So that’s where I would say I’ve got most of my early industry experience was during that time. And it was a good compliment to what I was learning in my PhD program, because it was a lot more applied. And then, during my PhD, I started to work with H2O, so I was working with the open source library and I started to build algorithms on top of H2O as part of my dissertation. And then just basically, it just made sense after I graduated to go work there. And so then that’s how I ended up at H2O. I’ve been there since April, 2015.
That’s an incredible journey. And at Databricks we also love open source software. We love partnering with H2O. But now you’re a chief machine learning scientist. And I think that’s a pretty unique title in industry. You see a few chief data scientist or chief data officers. Can you explain what a chief machine learning scientist is?
Sure. I can’t say I’ve met another, so I don’t know my description would be corroborated with other folks. But what it is in my role, is it’s a little bit different from a data scientist. I’ve had data scientist roles at the startups that I worked at before. I was a principal data scientist, and that was a little bit more accurate to what I was doing at that time, which was more applied working with data, solving problems with data. Now, what I really do is I design algorithms. I’m more of a scientist of the machine learning itself rather than the data science. So I guess that’s a description for maybe somebody who is actually developing algorithms or platforms for machine learning. And that’s kind of a little bit of a description of what I do.
I mean, I do a lot of things at H2O. You could say, maybe I’m sort of like an internal consultant to all the different teams as well, just about machine learning algorithms. Because I just have a lot of experience. And it’s sounding weird, but I know a lot about a lot of different machine learning algorithms rather than maybe just really good at one particular thing. And I have this statistics background as well. So if there’s other types of issues that are related to that, I can help out. But my team that I work on at H2O is the H2O auto ML team. And that’s the team that I lead and I started that team. And so now my focus is generally helping out with H2O, the library in general, but more just developing the auto ML algorithm itself. And that’s what I’ve been doing for the past three or four years.
Cool. Well, actually that segues really nicely to the next question, which is what is auto ML actually? Like, can you describe a little bit in like, who’s it targeted for? Is it for data scientists, people who can code people who are new to ML? Yeah. Can you provide a little context around that please?
Yeah, so that’s another pretty big question. So I mean, it probably means different things to different people. But for me auto ML, I like to think about it in terms of, not one definition, but like what are some of the goals or features of auto ML versus traditional or just normal machine learning? I would say one goal is to just train the best model in the least amount of time. That’s one take on it. So we’re not always going to look for the best model based on model performance. There could be lots of other things you could prioritize like interpretability prediction, speed, that type of thing. But whatever best means for you, some combination of these different attributes. So train the best model, in sort of the least amount of time with the least amount of effort.
And so that would mean, we want to minimize the amount of computation time once you’re running the software or the algorithm. And we also want to minimize the amount of time to just set that up. So typically, when you see an auto ML platform or library or solution, it should pretty much just be like, here’s my data, here’s what I’m trying to predict. And maybe here’s what I’m trying to optimize. And then how much effort? Shall it go for an hour? Do you want to run it for five hours or … In H2O auto ML, you can also specify number of models. So I want to train 50 models, or something like that. So in terms of software, I would say like any software where, where you don’t have to specify anything, except for those things that I’ve just mentioned would be auto ML software.
And I think there’s lots of tools that make it easier to do, let’s say hyperparameter search, but you still maybe have to set up the space and like decide which parameters you want to tune. I would say, I wouldn’t consider that really auto ML. I would say that’s really good tooling to help you quickly do what you’re trying to do. But for me, auto ML is very much like you don’t have to say anything, you don’t have to do anything. And you could customize it if you like, but you shouldn’t have to. So when I’m deciding is this library and auto ML library, I would look at do you have to know anything really about the algorithms beforehand? Or could you just sort of press a button or write one line of code?
So, that’s kind of how I would define it. And yeah, in terms of goals, I think another goal, it would just be let’s see if we can get some better performance than just a regular algorithm. I mean, if you can press a button, but it doesn’t do anything more than like a default random forest, then that’s not very useful. So hopefully you’re getting better results. Hopefully you’re possibly searching over multiple algorithms. I would consider a function that tuned like a single algorithm. If it did it all automatically, I would still consider that to be auto ML. But it would be good to have a tool that searched multiple algorithms. Because you never know what’s going to work well. And, I’ve mostly worked in tabular data.
So a lot of times that the best algorithm is some kind of like tree based method, like a GBM or an XGBoost or something like that. But, every once in a while it’s a GLM and you’re just like you don’t know why it’s just something with the data. Or maybe it’s a deep neural network, which don’t work generally as well on tabular data as tree based methods. But you never know what you’re going to get. And data can be quite different. And there’s lots of issues hidden in the data that maybe some algorithms can’t handle. So you want to have a sort of a multi algorithm approach in my opinion.
Oh, this is great. So actually I’m going to scroll back near the beginning of your answer because you actually have a great quote, which is train the best model in the least amount of time and effort. And so I guess I want to segue into this question, which is, well then how do you assess what is the best model? Because you’ve called out that, maybe sometimes you want to do multiple algorithms and you want them to, or want to try out different things like this. But then if you’re trying to do the least amount of effort, at least amount of compute resources, how do you balance like trying out lots of different models versus the fact that you’re also trying to say the least amount of compute cycles, resource effort, things of that nature?
Well, that’s the tricky part because I mean, anybody could write some kind of wrapper function that did a gigantic grid search and you could call that auto ML, but that’s not very efficient. So I think one of the difficult parts about auto ML is trying to prioritize what you think might be good at the beginning or so that you don’t waste a lot of time. That’s the hard part. And that’s the part that, this is why we see lots of different types of auto ML software, that take very different approaches. Because nobody’s really is, it’s like the million dollar question, like how do you just know exactly what the best thing is in advance and just search that and just be done quickly. That’s a really hard question. And I do think that the auto ML field might go in the direction of maybe we can predict that using machine learning.
That would be some kind of meta learning thing. There is research. There’s a whole workshop at Neurips about meta learning. And I do think that the auto ML research community will go in that direction. Because, it is just auto ML tools generally are quite computationally expensive. Because you are searching a big space generally and the good ones will figure out how to do that as efficiently as possible. And so sometimes that could mean maybe doing a few experiments at the beginning to kind of get an idea of where, what direction might be good to head in. Or sometimes things are just totally just a sort of predefined steps. I’ve seen tools like that. So, it just depends.
So you had mentioned trying to train the best model in the least amount of time, but whenever I build a model, I need to do a lot of feature engineering, a lot of data analysis to figure out what types of models that I should do. What are the assumptions of these models? And I just want to get your thoughts on how involved should the human be in this loop? Should they be involved at each step, for example, verifying the feature engineering, if there’s any missing values, they tell it how it should be imputed, or should we just let auto ML handle everything end to end?
I really think there’s a lot of room for both approaches. So there’s what you’re talking about is typically referred to as human in loop auto ML. That’s kind of its own thing. And I think that that can be very useful. There’s also a use case where you just don’t want to actually have to intervene. And there’s pros and cons of both, right? You might have a situation where you’re trying to train, let’s say you have a 100,000 customers and you want to have 100,000 models, one for each customer. And you want some kind of pretty much automated way to do that. That might be more suited towards the press a button and type auto ML. I will note that that’s a little bit of a dangerous thing to do. You don’t really want to just create a 100,000 models and put them into production without anybody checking to see what’s going on there.
So you would also want to build some kind of layer on, let’s say in between when you finished the auto ML models, and when you deploy. Something to check that those models are not biased or are not failing on certain subsets of the data, things like that. I mean, that’s just generally good practice for deploying machine learning models. I do know that some people don’t do that. They would press a button and just deploy, but I don’t think that’s a great idea, personally. And then for human in the loop auto ML, I would say that’s more useful if you have one particular data set that you’re working on. You’re really kind of committed to that dataset. And maybe you’re willing to invest a little bit more time because if you have some new, let’s say use case, at your job or in your research and you have this one dataset, you’re probably going to want to spend more than two hours, trying to get a good model.
And so if you just pressed a button, hand to auto ML and we’re like, okay, next problem. Moving. You could spend a lot more time and, and get a little bit more invested and still use an auto ML function or software to kind of speed up your process. But like you said, the feature engineering is still an area where a lot of auto ML tools kind of either don’t touch at all or do very minimal things. Certainly in the open source, that’s true. There are tools that specifically focus on feature engineering. We have one at H2O. It’s not open source, it’s called Driverless AI. So I work on the H2O auto ML, which is open source, and we do less of the feature engineering. Although, we are sort of systematically adding things in. In any truly something new, just stuff to make sure that we don’t blow up the algorithm.
Like if there’s a 100,000 levels in a column, we don’t want to just shove that into the function and have it explode. So we’re doing some basic stuff to make sure that that doesn’t happen. Target encoding is a thing that we do. And we do other types of stuff with all the H2O algorithms, we impute missing values. We handle categorical data sort of natively. So you don’t have to do any one hot encoding yourself.
On certain algorithms like GLM or deep learning, we do a standardization on the features, things like that. So I think feature engineering is still really difficult. And there’s not that many people that know how to do it well. I think there’s a lot of people on Kaggle, the data competition platform, if you’re not familiar with that term, that are very good at feature engineering. But there’s not a lot of open source libraries that where you can press a button and have it generate a whole bunch of features. So I think that’s where you’re going to see most of the effort with data science when you’re using auto ML tools is still on the data side. And then probably on the other side where you’re kind of checking things out, making sure things are what you want them to be, that there’s no sort of hidden issues and then deploy the model.
Well, I know we certainly want to get into Driverless AI and at Databricks, we understand that balance. You have to play with open source versus proprietary tools. So we definitely get that. I did want to ask you, a little bit earlier, you had mentioned with auto ML, you don’t want it to just build a default random forest and you want to do something smarter than grid search. What things does it do that are smarter than just a basic grid search?
So I think, when I say a basic grid search that, I think one of the things that you can do to add value on top of that is you can look at what are you grid searching to begin with? Or maybe not, I would say, you probably never want to do grid search. You probably always want to do random search. But you still have to choose which of the hyperparameters do you want to use? What ranges do you want to search over? How much time do you want to devote to certain areas of the search space? And I think that, if we’re just going to talk about that type of thing, so, that’s …
Grid search and random search are all valuable in the sense that you can easily parallelize these techniques. Other techniques are like Bayesian and hyperparameter optimization. That’s a very useful technique, but you can’t really paralyze it. So it’s a different trade-off there. There’s things like hyperband. There’s a combination of hyperband and Bayesian optimization called BOHB bayesian optimization hyperband. So there’s lots of different techniques for tuning, but yeah. It depends what your goals are and what your system is. If you’re willing to wait, or if you want to sort of like get something quickly by using parallelization, these are all options on the table.
Oh, this is excellent. Actually, by the way, glad you called a hyperband, because in fact, as part of Data Brew season two, we had the opportunity to actually interview Liam Li about hyperband. So that was great. Let’s just want to do a little shout out for him on that one.
Yeah, hyperband is a great method.
Never. Perfect. Nevertheless, I did actually want to roll it back up a little bit, and let’s talk about auto ML. We’ve been talking more on the geeky side, don’t get me wrong, always love doing that, but what do you think are some of the best problems that are best solved by auto ML and maybe some of the problems that auto ML’s not suited for doing? I’m just curious from that standpoint.
Yeah. Well, I would say for tabular data, if your data is can be structured in a tabular format, auto ML tools do a pretty good job. I think it’s a little bit harder, if you’re dealing with image data or text data. I would say you can kind of divide machine auto ML methods into two main groups. One, everything for tabular data, and then one everything for non tabular data. And that puts you into the more deep learning world. And auto ML in the deep learning world is more or less some kind of neural architecture search, or NAS is what people call it. There’s other things that people do as well. There has been a lot of research to make that more efficient. There’s something called ENAS efficient NAS, and there’s been a lot of development trying to get that to kind of be a good solution, but it’s quite computationally expensive and exhaustive.
And so there’s a lot of development in that world and it’s very different from what you see in the tabular data world. So I would say, the auto ML tools that are out there right now are quite suited for tabular data, not so much for the other types of data. Although there are tools in the open source, like Auto Keras is one that will do that type of thing. But, and then I guess I would say if there’s anything else sort of tricky with the data that you need to be aware of related to data leakage. So that can happen when I think if you’re working with clustered data. So let’s say, so an example is this data set that I was just working on with a colleague of mine in a Kaggle competition, it was called the women in data science, which is a conference datathon, and it just ended.
So there was some issues like that, that came up that you would have to kind of think about before just shoving it into an auto ML tool. So we had, it was medical data, so we had hospital IDs and then we had ICU IDs. And so there were some issues that had to be sort of addressed thoughtfully beforehand about how do you partition the data? How do you do validation to make sure that you’re not getting biased estimates of your performance? So if you have a cluster of people, you need to kind of keep them, usually you keep them together, if you’re doing cross validation in a single fold so that you don’t spread out the cluster over fold. So things like that, so you have to kind of be a little bit careful. I think, yeah. It’s just good to be aware that there are like tricky validation issues that can creep in, in any situation. So you might have to have some data science expertise when you’re still using these tools.
That’s actually a great segue into our next question. Since you’re talking about some of the issues that you have to keep track of, like, how do you split your data into your trained validation and test? How do you deal with bias in auto ML? And do you think there should be any type of regulation around combating bias in machine learning solutions?
I would say, I don’t think a lot of the auto ML tools are dealing with this at all, or machine learning tools in general. Typically, we have all of those libraries that look at sort of explainability, which is sort of a prerequisite for evaluating bias or fairness in a model. Generally, those are separate libraries that you have to bring in. They’re not so much included with the … Like in scikit-learn, it’s not like there’s all this fairness stuff sort of built in. You have to bring in these other libraries and kind of do a post-hoc analysis of your models. And so that’s probably, again, like something that might change over time that like, rather than what it looks like right now is like, you just build models like you normally do, and then you take those models and then you go and use a separate set of tools to evaluate what’s going on there.
Are there certain subsets of the data where the error is worse than others? And you have to, as a human, think about, okay, this column is sort of splitting the data set into certain demographics. And maybe I need to be aware of that, and then test out using disparate impact analysis on these different subgroups and understand. So, I would also mention that it’s a much easier to detect the bias than to fix it right now. We don’t have an auto unbiased tool, and probably have somebody saying that they do, they’re probably not correct. I would not be surprised if we see lots of that. I mean, there are things that you could maybe do somewhat automatically, but I think it would be unrealistic like right now in March 2021, to say that there’s some kind of auto unbiased, like de-biasing tool.
I hope that changes. That would be extremely useful. And there is a lot of research in this field, especially in the last few years. So we’ll probably see something like that, but yeah. There are just a number of methods, but a lot of them are really experimental. And I think, everything has to be sort of evaluated on a case by case basis. You can kind of divide the approaches into, do you kind of try to fix the data? Do you try to get more samples of the subgroups that are not represented as much? Sometimes that can help fix the bias. Maybe it’s just a sampling issue? But then there’s just all these other issues that might not be good. And so this is something that I’ve been looking at, at H2O.
I gave a talk about this at the useR conference over the summer, which is like the big R conference for those of you who are not R people. And yeah. I mean, I was trying to build this demo that was like, how could we build fairness tools into auto ML? So in H2O auto ML, one of the things that you get back is this what we call leaderboard. And you see, it’s sort of a data frame that has a list of all the models and it’s sorted by default, by, like, model performance. But then you can get some other metrics, like how long was the training time? How long is the prediction, speed, things like that. But I thought, “What if we add like more columns that have to do with fairness?” And we can kind of have a flag for is there some kind of issue let’s say that we detected through disparate impact analysis with these models?
And so basically what ended up happening on this data set that I was working with, it’s called the home mortgage HDMA. Home mortgage … I can’t remember the name of the acronym. But basically it’s data collected by regulators about home loans. That’s a regulated industry. So banks have to report, these are the loans that we approved. These are the ones that we denied. And they have to report the demographic information to make sure there’s no issues. So I thought, “Oh, this will be great.” Because we can just calculate which models are unfair and like filter them out. And what happened was all the models were unfair and that was a problem. Because now we know that, and now we have all these models, which are showing that they’re discriminating against … I mean, discriminating is a loaded word, but there was a disparate impact on certain subgroups versus others for all of the models that were trained.
So if you get lucky, maybe some of your models will end up just naturally being fair, just by randomness or something like that. But I think you need to really … I think in the future, what we will be building is something that’s sort of fair from the get-go rather than training a bunch of models and hoping that something turns out to be fair, and then you just pick that one. I think we’re going to have to like really redesign machine learning algorithms from the ground up basically. And it’s going to be a big challenge, and it’s going to be a lot of work over the next few years. But I think that would be a better approach than just kind of hoping that we can sort of filter out things that are unfair.
You have a really good point about these fairness tools are decoupled from the core library. It’s like if you’re building a scikit-learn model and now I want to evaluate how it did get my RMSE. Oops. I need another package for that. And so I can definitely tell that you’re very passionate about the area of fairness. And I’m curious, kind of switching gears a little bit, instead of focusing on open-source tech, focusing on the non-profit that you created, women in machine learning and data science. And I’m curious what inspired you to create this nonprofit and what excites you most about this organization?
Yeah, thanks for bringing that up. So yeah, the organization is called women in machine learning and data science. WIMLDS.org, W I M L D S.org. And I started it in 2013. That was when I was a PhD student. And I had been to the women in machine learning workshop at Neurips for, I think I went in 2012 and then 2013. And after the first one, I was like, “This is amazing.” Like, all these women, I go to all these meetups in San Francisco all the time. And like, there’s like two women and it’s kind of like, I wish that wasn’t the case. So could we maybe try to make that better? So I got inspired by the women in machine learning workshop at Neurips. But that’s only sort of one day a year, you have to go to Neurips. It’s very focused on more like people in academia.
There are a lot of grad students that go and some professors. So I felt like, could we take this idea and apply it to like the traditional Bay area meet up machine learning meetup thing? And, I had been part of a meetup before that in I think it was 2009. I used to go to this hackerspace in San Francisco called Noisebridge and we started … Or I didn’t start it, but I ended up co-organizing the machine learning meetup there. I think it’s actually one of the first meetups in the world on machine learning. I’ve tried to figure out if that’s the case, but it was quite early 2008 is when I started it. And I started organizing it in 2009. So I had some meetup experience of organizing and … But that was a pretty small group.
So I thought, let’s just take this idea and make it into a normal Bay area, tech thing meetup. And so that’s what I did. And then people, if you sign up on meet up, they kind of like, do your promotion for you. They send out … they use some recommendation engine to like find people that might be interested. And so, basically, I just put it out there and people started to come. And then about a year later, somebody contacted me in New York and said, “Hey, I noticed you had this meetup. It sounds really cool. Like I would like to do that here in New York.” And I was like, “That’s a great idea.” So then they started meeting in New York chapter and then about a year later, somebody in North Carolina kind of in the research triangle area was like, “Hey, I saw your meetup. Like, could we create a chapter here?”
And then like, we got like one a year, for like three or four years. And then I don’t know when it started, I would say the inflection point let’s say would be like 2018. I think we like doubled the number of meetups. And so we now have around a hundred chapters all over the world. And so it’s all been very organic. I don’t go try to find people to start meetups. We’re pretty vocal on Twitter. We’re at W I M L D S on Twitter. I think that’s where people hear about us. I’m not sure. But anyway, somehow people find our website and we have a little sort of starter kit that helps people get their meetups started. We set up their meetups and provide the infrastructure and we’re non-profit as well.
So yeah, so that’s what we do. We try to keep our expenses quite low. Meetup is our main expense. It’s quite expensive, and not a very good value for your money, but what are you going to do? We’re all beholden to Meetup until somebody creates a better sort of open source version of that. So yeah, basically we just beg people for money to pay our Meetup fees and then we survive. That’s kind of how it is. And I have a whole group of people helping me with that. And all of the people that run the chapters pretty much do all the work for their own chapter. So yeah, it’s a really nice community.
Well, I actually found the WIMLDS group through a meetup, so I am glad that you paid those fees. Otherwise I wouldn’t have found the meetup and I wouldn’t have met you. Erin and I actually go back a while when Databricks co-hosted the women in machine learning and data science meetup in San Francisco, which was a great event. Luckily Databricks sponsored that once we didn’t need any extra money for food or any of the events.
Which went well for us to have these companies. We don’t have to get into very long complicated partnerships. It’s just like, “Hey, do you have female or non-binary data scientists at your company that want to talk about what they do, or maybe even just talk about something technical?” And then they host us. And I mean, now we’re all online, so it’s different, but we’ll get back to that, eventually. I think.
I do want to do a shout out in case anybody’s interested in hosting their group. It is a great recruiting tool, and it doesn’t cost you anything more than the cost of food. So we definitely had quite a few women apply to Databricks after that. So I just want to say thank you again for letting us host your group.
Yeah. And I think that’s the value that companies get out of it, is they get exposure to a bunch of women who are data scientists, and lots of them are quite experienced there. There are a lot of people that are sort of new to data science as well that come to our meetups as a way to get into the community and meet people. But yeah, it’s really a win-win situation, I think. So.
All right. I’m going to go ahead and close out our session since I realize we’re at the top of the hour right now, but I wanted to say thank you again for joining us, Erin for sharing your expertise on auto ML and discussing how you got into the field of machine learning and the women in machine learning and data science meetup.
Thanks a lot. Thanks for having me.