This session is part of the “Managing the Machine Learning (ML) Lifecycle Using MLflow” series with Jules Damji.
This workshop covers how to use MLflow Tracking to record and query experiments: code, data, config, and results.
Suggested Prep Work:
To get the most out of this series, please review and complete the prep work here: https://github.com/dmatrix/mlflow-workshop-part-1#prerequisites
– All right. So thanks everyone for joining us for this workshop series. This is a three part workshop series about MLflow managing the complete machine learning life cycle with MLflow. So this is the first of, three workshops. This is gonna be your introduction to MLflow, how to use MLflow tracking.
Okay, so there’s a call out there, all videos will be available after we’ve broadcast live. That’s the YouTube link there, when I’m done with my few slides, I’ll drop that in the chat. And then also, if you’re on YouTube, I update the video descriptions to include the links to other workshops. So, I’ll get that up as soon as we have the content ready. So this is part one, part two is gonna be next week. That’s on next Wednesday. So that’s gonna be understanding MLflow projects and models. And then part three is gonna be model registry workflows explained. So that’s gonna be in two weeks on the Wednesday as well. And so there are links to RTP to both those sessions are in the data plus AI meetup group.
And then also too, we put notifications out on YouTube. So I encourage you to subscribe and turn on notifications in YouTube if you like to join from there. So, here’s the links again as well. I’ll drop those in the chat when I’m done. A few things I wanna call out for the session. So if you’re joining us on zoom, I know we have the chat feature and we also have the Q and A function. So a chat works really nicely. If you have any audio issues, I can try to help you troubleshoot anything. If there’s any kind of general comments you’d like to make, and then please drop all of your questions in Q and A. So we have a few TAs joining us today that will help you troubleshoot and answer any questions. And it’s really easy, nice feature for them to kinda moderate it and help you out there. So I encourage you to drop your questions in there. And then, like I mentioned, all these workshops are gonna be recorded and they’ll be available on YouTube after. And everyone who joins in zoom will get a followup email 24 hours with all the links to the resources and links to upcoming workshops and all that good stuff. So, if you have to jump off early or you’d like to revisit the session and kinda work through things at a later date, no problem. We’ll have all the content, all the resources available for you. So with that, I’d love to take just a quick second and have our TAs and instructor introduce themselves really quickly.
So why don’t we start with our TAs? So Lan, would you mind introducing yourself? – Hi, my name is Lan Jiang. I’m the Regional Lead of Residents Solution Architect, living in Chicago. Happy to be here to support this webinar. – Awesome, thanks for joining us. Sean?
– Hi there, this is Sean Owen based in Austin, Texas as you can see. I’m a solutions architect here and I help lead the machine learning and data science for our solutions architects. – Thanks for joining us, Sean. And Amir.
– Hi, this is Amir. I’m a Solutions Architect based in Ottawa, Canada and I support our MR facts. – Great, thanks for joining us, happy to have you. So with that, I’d like to pass it over to our instructor today. Jules, take it away. – Okay, I’m gonna start sharing my screen. – Perfect.
– Can you see it?
Can you guys see my slides?
– Yeah, we’re seeing it. – Alright, You can hear it, brilliant. Well, welcome everyone. We have a global audience. Good morning, good afternoon, good evening, wherever you are. My name is Jules Damji and I am a developer advocate here at Databricks and I’m really delighted to share with you what MLflow is all about. A Platform for Complete Machine Learning Lifecycle. So with that in mind, I’m just gonna go straight into the material and get through the material. We have quite a few things to cover today. So this is the outline of what I’m gonna cover today.
Today’s session is gonna be focused on Mlflow tracking. But before I do that, I wanna set the context of what are some of the overview machine learning development challenges that most of you people who are developers in a way. They’re data scientists and machine learning developers, and what are the challenges that you face today? And this is not something new. This is sort of historical, but a lot of the things have evolved since then. And I’m gonna propose or to share with you how the MLflow tackles these particular problems and what were the motivations and the concepts and the philosophy behind tackling those issues. And we’ll briefly talk about the four MLflow components that comprise the platform and how they actually handle each of the challenges. But in particular today, we’re gonna focus on MLflow tracking and how you can actually use the fluent APIs available to you. So you can infuse your existing machine level code, or your machine learning algorithms with this tracking, so you can actually use it. Now for today’s session, we’re gonna use the free databricks community edition and then within it we’ll use and share with you how you can actually use MLflow UI to explore that. Now there are a lot of tutorials that I’ve compiled for this particular session. And there’s no way, there’s no way in this little time that we actually have that we’ll be able to go through it. So I’ll walk through probably one, but the material is there for you. And so you can use them later on tonight. If you wanna use that as a sleeping aid, that might help you go to sleep. And then if there were time left, we’ll do some Q and A. Now the URL that you actually see on your screen is the Github where you actually do want to clone this Github on your directory, or you wanna go to it to find out the instructions how to go ahead and sign up for community edition. So I’m gonna leave that for five seconds. By the way, if you go to the particular URL, there’s a directory called slides. So if you wanna follow with me with the slide, you will then be able to do that. So I’ll leave that for maybe five seconds cut and paste that thing on your browser, and then go to the particular directory and clone the Github that you actually have. So I’m gonna leave it for five seconds and then move on.
– [Karen] So I just dropped the link in the chat. – Okay, brilliant, thanks a lot- – [Karen] So, hopefully yeah- – All right, So, let me start by making an assertion that machine learning development is really complex. And the complexity doesn’t necessarily strive from the fact that the theory behind machine learning is difficult or the math is difficult. Or the algorithms which are presented are difficult. Although there there’s some element of complexity, but the complexity really strives not from the languages in the APIs and the frameworks that a lot of you- Those are quite simple. In fact, today’s frameworks have extracted away all the algorithms. So you just create an instance of a particular class and provide simple default parameters that give you pretty much a good baseline model to work with. The complexity really strives from couple of things. One is the elution of the model from its conception, all the way down to the final stages of its deployment. And the second problem is that the paradigm in which how you do typical traditional software development is different from how we actually do machine learning development. So let’s look at each of those two and see where the problems actually stems from. And where does the complexity strike from? So let’s look at the first one.
If you look at the traditional software machine learning versus the machine learning development cycle, the goal is very different, right? If you come from purely a software engineering development background, the goal for you is that you had given a functional specification, and you write your code that predicts the specification. Which is slightly different if you contrast that with machine learning. Machine learning goal is you have a metric that you wanna optimize. And you wanna get the metric to the point where you feel reasonably confident that the accuracy is good enough for the model’s predictions or whatever classification or whatever it’s actually doing so that you feel comfortable while you do that. And that’s just not achieved just one time, it is achieved through constantly experimenting and improving it. Second, the quality of the code also depends on how well your software is actually deployed out there. And what are the SLS if you actually doing it as a service. Whereas the quality on machine learning depends on not necessarily on the code itself, but also you could actually write very good code but depends on the input that you actually aim for the data and the tuning parameters. It’s the more data that you have the chances are better that your model will generalize to that particular output. So you can actually do good prediction on unseen data. A lot depends on the tuning parameters. And this particular step is very iterative. This whole idea about experimenting over and over again to move the needle so that you actually get closer to the accuracy. The third problem stems from the typically when you’re actually dealing with a software development cycle. Your stack is kind of limited. You have a few libraries that you deal with, which are standard. They don’t change as much and hence there’s a major revision and the tools you use are also very limited. And you might have maybe one or two, a couple of databases and so on and so forth and few languages that your infrastructure support. And it’s limited in terms of how many environments you’re gonna deploy that. So that’s part of the traditional software. Whereas if you compare that with the machine learning, you actually wanna compare and combine many libraries. In other words, you want the best of the libraries available today. The most recent ones available that actually addresses the business problem you’re trying to solve, which is going back to number one, which is the optimized metric. And they’re diverse environments in which you actually want to do that. So that’s problem number one. The paradigm is different. Traditional software development is different from machine learning software. The second is that if you look at the proverbial quintessential diagram of machine learning cycle, these are the different stages.
There might be other stages in build, but these are sort of the paramount stages. Now, if you look at each and every stage, there are different set of tools that you might wanna use. If you’re, for example, doing data preparation, you might be using ETL, you might be using spark SQL. You might be using a psychic run or Pandas to do your EDA. You might use python language. You might use Java, or you might use Scala. So you have various tools where you’re using that for data preparation. If you’re training, you actually want to have, as I said earlier, in the previous slide, you probably wanna have more than one library and combine them in contrast to create a model that gives you the best metrics. So you might use say extra boost for the baseline model, or you might use psychic run as baseline model. And then you use XGBoost as an expanded model, and then compare and contrast to do that. So that’s problem number one. You have all these different tools available for you at different stages and each have their own requirements. The second thing is tuning. Each of these different tools will have its own tuning parameters that you actually wanna capture during the stage of the model’s development. The data preparation we’ll have all this libraries, all this configurations that you wanna tune. But fundamentally more importantly than the more imperative is the tuning that actually comes in the training of the model. As I said, it’s an iterative process, and you’re gonna have loads of hype provided the tuning is available for you to actually tune that. So you actually do have to do that experimentally and you have to worry about how do you actually capture it to make sure that the eventual model that comes out is the best model that gives you the best metric and has captured the essence of the older tuning parameters that you’ve explored over the course of your experimentation. The third problem is that you have to worry about scale.
In today’s data and ML decade, we are dealing with large amount, a magnitude voluminous amount of data. And so you have to deal with the amount of scale at that particular level. And each and every stage has its own level of complexity. So you have to deal with scale. So that’s the third issue in this particular cycle that you had to deal with still. The third one and the fourth one is that how do you actually ensure that the model exchange is between training and deployment is the same model that you used to get the best metric is the same model that you exchange it in the deployment and the same model that you actually do when you’re actually doing modern monitoring. How do you actually ensure that reproducibility between the training and the deployment to make sure that what you train is what you actually deploy. And the tuning parameters that you actually use are the same one. The weights and the coefficients for your algorithm that they actually computed in training are no different from when once something is actually deployed. So that’s an important part of it. And the fifth one, obviously today, we are living in the world of data and privacy, and we actually need a good amount of governance and provenance. We wanna have the model to be interpreted. we wanna have the model to be able to tell us how the model actually evolved, who used it, when it was used and so on and so forth. So all these kinds of things, sort of pause an interesting challenge. Now you’re probably asking a question, Jules, what’s the problem? If this is such a predicament, what are the big data companies doing today, especially those ones who would deploy models on a daily basis, on an hourly basis, on a weekly basis, they produce models in scales of hundred.
Why do they do that? It turns out they have done a fairly good job in standardizing some of the data preparation aspects of it. They’ve done a fairly good job training and deployment loop. And it’s good if you actually work for this company and you can actually use those platform. The problem actually comes in that some of these particular frameworks and algorithms that you actually wanna use might be limited. In other words, they will be more catered to the business problems that these big companies are solving. So they might only use a tensive flow or pike torch, and maybe psychic run and not any of the other tools. So they’re very much tied with the company’s infrastructure. So if you move out and you go out somewhere else, if you leave the company, you pretty much lose the intellectual property. And why do you actually do that? And so we thought about the teammate data breaks when we actually were thinking about them, we thought about, what is it that we can actually do? What is it we can actually learn and borrow from these standards that they have used? But doing it in a very open manner. We wanted to actually do it in the manner so that you can use any libraries that you want, or the common libraries that you want. But doing it in an open manner that you can go from one company to the other one. You can take that intellectual property with you, or the knowledge that you have, which becomes functional and becomes transferable.
And so this is what we actually came up with. We said we’re gonna create the particular open source MLflow product and we’re gonna allow the ability for popular machine libraries and language to be a part of their political platform.
We want to have the ability for people to run the same way in which they created the model locally and they run it both in the cloud in the same way. In other words, the model they actually produce are easily transferable in the exchange. But particularly we wanted to make sure that it was useful for one or N+1 for a small organization or for a large organization. But more important, I think the imperative was that it had to be simple. It had to be modular. In other words, they didn’t want it to create this monolithic thing. It had to be easy to use, and the focus was on the best developer experience. And when you actually provide those ingredients in any particular platform, you actually see the efficacy of it. There is an enormous, proper relationship. It’s fruition it’s related to friction. The more friction you’re gonna have the less fruition you’ll have. If we wanna make sure that the friction was minimal to get started. So how did we accomplish that?
It was basically based on two principles in software engineering which you can go back historical to see how some of the open source platforms have actually evolved. And even some of the old platforms they’ve become very successful because of the adoption. First is ground up API-First. You first identify what are the functions, what are the things that you actually wanna do in the particular platform and give them as a source of APIs. So we say, what are the things people wanna do? They wanna submit runs. So we’re gonna provide API to log runs, to log the metrics. And those should be across all the popular libraries and language. We wanted to say, if you want to deploy this models in a diverse environment what is it we can actually do? What about if we abstract the model and think of it as a lambda function, that you can actually take that and put it in diverse environment and invoke a particular predict function on it so that you have the ability to take those models and put them into diverse environments. And the important part was that make that interface open. So if there is a new framework that has algorithms that we actually wanna use, they just have to write to that particular interface and then becomes actually part of the MLflow API. So the API first, and then labelers were in a built around programming low level and high level APIs, with REST APIs in CLI. And you have to look not very far from today. If you go to the Unix, if you go far back, how Unix and C interface was first introduced, it was a brilliant idea. The interface C was so well documented and so clean and decoratives that did allow you to build application programs on top of it. Allowed you to be system problem, like load managers, window manager, and so forth. A Java is another example, JVM and Java APIs first gave you the ability to write all these very complicated network-based algorithms and network base platforms. And so that was the whole idea behind the API first. And the second thing was that we wanted to make this sort of modular. We didn’t want it to have this be monolithic, where you use everything or nothing. We just wanted to have the ability for you to choose one aspect of the MLflow platform and not necessarily use the other one. For example, you can use MLflow tracking, if you just want to do to track things and not worry about using it for modal registry. You don’t wanna worry about using it for environmental projects. So we wanted to be distinctive and selective. And the way we actually did that was make sure that we actually keep the components modular and more specific in function in what they’re doing. But you can use it to change it or you can use them independently. And I think those were the principles and the philosophy behind, so the MLflow. And so what resulted was this four components that do four distinct things, but still work together and can still be work independently.
So I can use MLflow tracking to do on my tracking. I can use projects to package my particular science code in the manner that I can exchange it with people and I can reproduce it in the manner so that how I develop on my machine or how I do in the cloud will be the same way it’s gonna run anywhere else. So that was the whole idea about projects. And the models is the way to somehow package the model in a standard way so that we can actually deploy them on our diverse environments. And then finally with the new model that the component of engineers for the model registry and the registry is nothing but a central repository where you can store an annotate and describe models in a central place. Think of it, modern registry as a guitar. But within an organization where people check in stuff and people check out stuff. And you can have different versions of the model in different stages of the development. And so that way collectively, it gives you the idea of how you can actually use that in your entire development cycle. So let’s talk about the first component, which is the focus of the presentation today, which is MLflow components, right? What is MLflow tracking? And as I said, it allows you to record query experiments in code. Now, tracking is not a panacea. We’ve been tracking and logging things since the medieval ages and the barter system. We sort of wrote down what people owed, what accounts or how much money we owe. So tracking is not something new. So how does tracking work in hymnal flow? But before I get into it, I sort of wanted to introduce you and set the context of what are some of the concepts behind tracking.
So what are some of the entities that we actually track? And usually in the parlance of machine learning or data science in a parameters on an important part of tracking. You only have parameters that you supply to you input of the model that create a particular model that uses certain parameters. Parameters could be your learning rate parameters. In the neural network, it could be the number of filters you actually using. Or what loss function you’re using. or what are some of the other parameters that you- How deep is your treat? How many random forest estimators they’re using. So these are the parameters that you actually use in order to build your model. And we wanna capture that because it actually helps us to figure out what parameters we used to create the metrics, which is a second one. So the input of the parameters that we track gives you the metrics. And the metrics of the values that you actually are going to judge and ascertain whether your model is good or not. Tags and notes are the other things that we actually captured as entity data. And tags are good for notations. If you wanted to use it to search API, so you can search through the entire tracking to see which models are associated with one particular tag. You can think of a tag analogy as something very similar to how you tag your software releases. But the idea is the same.
And artifacts are the things that you actually might produce, or you might have an input to. These are the files that you might produce in your training. For example, you might create some plots that you want to capture and you wanna associate that with the outcome of your run. Data could be path to your input data, validation data, and so on so forth. And models are normally the outcome of what you actually have trained. And those are serialized versions of the flavor of the model that you actually use. And source code obviously is what ran. Was it shell script? Was it Java code? Was it the python code you ran? And the Github version of that. And the important part is the run. And the run within the context of MLflow is an instance whereby you’re gonna create a particular run and that run is gonna take your parameters the metrics, create a particular model, do the prediction, and then give you the output. And then experiment is a collection or a set of different runs. You can actually create an experiment and have several runs. So that’s the gist behind the concepts in MLflow tracking. So how does it actually do that?
But before I do that, let’s look at what would be the world without MLflow tracking. Now here’s a pseudo code. Think of it as a site gives you the code LAN, and you don’t have Mlflow tracking available. And you resort back to the old traditional we are tracking, which is what? Using log file, or putting it in a database or creating the Jason file. And then putting the Jason file away. So it’s not the tracking. MLflow is pandecia for that but MLflow makes things a lot easier. So if you didn’t have MLflow, you would do something very similar like that. You actually load a particular file. You extract it and grounds, you train your model, you compute the accuracy, and then you’re gonna write print F to a particular log file. And then you’re gonna pick up the model and you keep on doing that over and over again. Now the problem comes in when your input data changes. Now you actually have a large data. So if you have a large data, now we actually have to start doing all over again. So now we create a different version of log file. What about if your tuning parameters change? Your Ingram change from two to four, to six, to eight, and now you have to run those experiments. Or your learning rate changes. You’re tuning parameters. Or your library changes. All these kind of have the idea that you have to keep track of things. And then finally, what about the code version? If the code version changes now you’ve got to worry about which one of the code was running. So if you take all these problems, logical conclusion, what you end up is rather a very intractable solution. And so the idea was that what if we give data scientists an MLflow pro attracting API data can just infuse as part of their machine learning code. And this is not something new. It’s just something that you just use to log those metrics as you would do that. But do that in a very simple manner in written very intuitive manner. And so if you to take the same particular program and say, what about if I give you the MLflow tracking with fluent python API, that does exactly the same thing that you were doing, but do it in a more manageable way and doing in a more scalable way.
And so it’s the same code that you’re actually using, but the only thing you’ve actually done is the important MLflow. And you start that particular run with the tracking server. And its okay these are the parameters. And these are the entities that I care about for my experiment and go ahead, then load them. So you can run them hundreds of times, you can run them thousands of times. And every time you create a run, every time you start to run within this python context manager, you actually have a session with the tracking server. And it logs all that on the tracking server. And the benefit about doing the tracking is that the MLflow UI, which is the CLI whereby you launch the UI, gives you the ability to look at each and every run and go more deeper into exactly what the run, constitutes. What the parameters were? what are some of the artifacts that you actually created. How you can compare several runs together to find out side by side what the metrics look like. And this gives you sort of a more systematic, more mathematical way of tracking your experiments or tracking your runs. And then visualizing them to see which is the best metric? Which is the best run that produced the best metric for me. So then I’m gonna choose that as part of my model to go to the next stage. So that’s what tracking is all about. And then deep dive, a little bit on what tracking looks like.
You can think of everything on the left hand side, over here as the producers. And so your notebooks could be the ones who are actually doing your ML algorithm. And then within that they have tracking using python, Java, or REST API. Well, we don’t have Java notebooks, but you can actually use local applications to do the tracking. Or you can run them from the cloud as long as you use that. And then you send it to the tracking server. And the way you specify where the tracking server is either to the environment variable If you are running on a local job, or you set it programmatically by saying, here’s my tracking your eye running on this particular location, this particular port, go ahead and stop logging stuff over there. And to put people on the right hand side, you can think of the entities on the right hand side. That those are the consumers of the tracking that you actually use. So there’s UI available for you, which you actually saw in that animation that extracts information, and then presents that in a digestible manner. In a manner that you can actually look at visually and then compare and contrast. You have APIs available, which are the same APIs that you actually use to write. You know, the APIs are used to consume that. And they can actually use those APIs to convert your metrics into data sources like data frame, and then put them, or store them in your spark data sources. So that’s at the very high level of what tracking is all about. And going a little further, as I say, tracking is sort of a client that actually talks to the server to the HD protocol and then we use protocol buffers to sync all the older data back and forth. And the tracking server then is completely stainless, which means that you can actually use this behind a load balance and to actually have multiple instances, as long as the right to the same SQL outcome, and compatible database in the backend. So all of the entities is actually stored in the database, all the runs of the parameters, and all that is actually stored in the entity. And the artifacts such as models and files could be stored on the local file system, or it could be stored on a cloud storage. And the tracking is actually about to buy the entity store, which like I said, if you’re running on the local file system is really the directory goal ML for runs.
Underneath, you will have a file, a directory file structure, that has experiments. And then other that you have run ideas and then you have all the parameters that were actually stored. Starting with one dot, seven dot two we actually store the local runs in an SQLAlchemy database, that’s compatible. So you can use Postgre, you can use MySQL as well you can use SQLite. And if you don’t like any of these there is the ability for you to use what we call MLflow plugins, which was the community contribution that allowed you to customize your entity stores. If you don’t like any of these, you can actually write your own meta store and then use the plugin APIs as an interface to now log everything to your customized Metastore that you actually want. Now on other managed Databricks MLflow, We actually use MySQL and AWS and Azure. And then the artifact store could be any one of these entities, which are supported on the local file system. As I say, it just becomes a local mlrun look of our system where we actually put all the artifacts and all the entities, all the metadata. On the cloud, it could be backed by the S3 backed store on Azure used by built storage and Google can use GCP. And on the DBA for artifacts that are part of the Databrick. So if you’re not using Databricks, you don’t need to worry about that. You can actually use any of these.
The MLflow tracking API sort of comprises of three sets of APIs.
You’ve got at the very high level, a very pythonic fluent APIs, which give you very high level operation to run the experiments, to log the experiments. And the model flavor APIs that allow you to log a particular model or to load the particular model.
Those are sort of fluent APIs that are very pythonic in nature.
And what I mean by pythonic in nature
is that they follow all the python idioms, whereby they might have iterated routine or they’re iterable. Or you might be able to use a LAN algorithm. Or they might have a hash function. So they’re very pythonic in particular nature. So those are the high level python operations that’d be actually used. And then you have the lower level client APIs, which are sort of gives you the more control over the creation and the the removal and the update and delete of those particular experiments. And then you have the MLflow for tracking client that actually used that. And then finally, we have the REST APIs. If you don’t wanna use any of those, you can go directly to the REST APIs. But the progression is that the high level API is underneath called the MLflow client, which then uses the rest API to talk to the tracking server. So that in an essence for today is what tracking is all about. And then here’s pretty much the summary of what I talked about.
And I think as we go to the next session, you will actually get a better idea of where the other components work together. So it’s very modular in particular fashion that simplifies some of the challenges that we actually observed in our lifecycle. And important pand is very easy to install. You can actually just use pip install and it provides a good double experience. And we’ll see that when we go through one of the tutorial. And the idea was that we wanna have the ability to do something locally, as well as remotely. And the important part was that give them the python, fluent API as a Java and R Scala is a client coming soon. And far more important is that once you actually run a thousand experiments, you won’t be able to visualize them in a methodical way to compare the runs. And then finally, once you have the model that you care about, you can register it with a particular mode of registery in the central place in your organization so that other people can discover the models. Other people can check out the models. Other people can test the particular model, can inspect the models. And this comes really handy when you’re working with say Jenkins, or you’re working with CICD tools where the APIs can check a particular model out of a model registry and check its existence. Whether it’s in staging and development and they can run each new test on it. And once they are ready for production, once they’ve done the AB testing, they’re happy with the metric they can change the stage to production and then hand it over to ML obstacle who can take it and deploy it. So that in an essence of MLflow tracking. And the next week, we’re gonna talk about the two other components. And then you’ll see how they all tie up together. And so today, I mean, we are very proud, 200 contributors in the community. This is sort of an apple and oranges but spark was different back in the days. But since you’ve actually released MLflow for Spark, it took us about in a few years to get to the level of commuters. And over here with MLflow was quite easy to do that. So the community is very, very participating in contributing. And we would love for you to start using it and start contributing and helping us to putting the missing elements that they actually find the open source. And it’s very easy to get started. Pip install on your local machine to get started.
Find the docs are available on mlflow.org. Pursue and peruse the code examples on GitHub is a good way to get started. We have a Slack channel, they actually join. And there are tons and tons of tutorials we have written for you to actually get started. So there’s ample and ample of material. And so now let’s go on to the fun stuff. Here are the editorials. I want you to go to the particular GitHub and sign up for community edition. Can I get a vote? And how many people have already signed up on community edition?
I can go straight into it.
Okay, let’s do it.
Can everybody see my screen?
– [Karen] Yes, we can see it. – Okay. Now I’m assuming that you actually have signed up for community edition and you have logged in. And I’m gonna give you maybe a few more seconds to make sure that you’re already logged in. And I’ll show you exactly what are the next steps.
So once you actually log in what you’ll see, you’ll see this particular screen. You’ll see this particular screen once you actually log into the community edition. The next thing you actually wanna do is you can actually- Everything that you have on the left hand side is a navigational bar that allows you to give you a high level of pressure on the right hand side. So I can go to my home, which is my workspace. I can go to my workspace. And when you go to a workspace, you will actually see your login credentials over here. Those are your login credentials. That’s your login. You’ve actually used to create for click on that you will have your workspace. And workspace is nothing but a folder of directories that you actually create. And so over here, what I want you to do now is look at the pull down menu and then click on import. When you do an import, you can actually browse and login and input the Mlfow DBC that you actually downloaded from either the GitHub, or you can send it to the notebook directory where actually has the DBC file. And when you click on it and click it open, it’s gonna give you this particular term.
When you hit import, it’s gonna import that particular MLflow C files, and we’ll create a particular folder in the workspace, you actually have that. So I’m gonna live that up and see how you can actually do it.
How are we doing on the chat?
Done it?
– [Karen] Let’s see. So I don’t have any replies yet in the chat. Yes, imported, all good.
You folks have it. – Okay, once you import it, what you will get is this particular directory. And now you’ll actually have your three particular notebooks in there. Now, we wont get a chance to go to all of them. As I say, it takes a little bit of a while to go to each and every of them. I have a lot of extras over here for you. So there’s gonna be a lot of work for you to do that. And in each and every note book at the end has an homework assignment. So, feel free at any time free time that you have this evening or during the week to go through those. And so you have this community edition, which is actually available for you, so you can actually use them at any time. So when I click on this, I go to my particular notebook. The notebook is not loaded. The next thing you’re actually gonna do is you’re gonna create a particular cluster. So you go to this particular tab. – [Karen] Sorry Jules, just to interrupt. Can you show back what the import URL is?
– So the input URL is that you actually, when you’re doing import, I can either go to my browse and I can go to my GitHub.
I have my own Githubs here, I have my Gits. And when you clone that, I’ll have my part where you will clone the part one. I go into my notebooks, I click on my DBC and I open it.
They get it? – [Karen] Yep, I think so. That’s helpful, thank you. – All right. So once you’ve imported this particular folder will be created in your workspace. And the next thing you wanna do is you wanna actually create a cluster and the way you actually create a cluster in the Databricks is you go to this particular tab over here. You create on the cluster. And then gives you this particular tab. For the purpose of expedition, I already created my cluster. And you go in and create a particular cluster. Typing your particular name you know, my workshop.
And make sure that when you choose the data printed runtime version that you actually want, you want to get the 6.5 ML runtime. And the reason you wanna choose the 6.5 ML runtime, because it includes all the libraries. Everything is actually installed for you. So you don’t have to install the MLflow. You don’t have to install the psychic run. You don’t need to install the current TensorFlow or XGBoost or whatever you actually have. So it’s much easier for you than that. So you select that particular runtime, and then you hit the create button. The reason my create button over here is great out is because I already have created the cluster. So when you create the particular cluster, it would start creating a cluster and you’re sharing with maybe 10 other people in the same container. So it might be a bit slow, but we’ll work through it. Don’t worry about it. You have this cluster available for you anytime so you can actually work on your letters. So when you create the cluster, it would start spinning. And once the cluster is created, the spinning will turn into a solid green ball. And that way now, you know that you actually have a cluster. So I’ll go back to my workspace. I’ll go back to my notebook. And then once the cluster is created, you pull down this and you just go in and attach to their clusters. You will see this. I’ve created my cluster called MLflow workshop. And I’ll just attach to that. And now I’ll just run one plus one. Now there are various ways you can navigate around notebooks. If you’re not used to notebooks, I just give you a brief introduction to what you can actually do.
So this particular keyboard over here, the little keyboard symbol actually has all the different commands, quick edit mode and command mode you can actually use. But for the purpose of navigating, it’s lot easier to actually just click on the particular sale. And there are two ways to execute the sale, three ways rather. One is you can actually go to this particular right hand button and pull it down and say, run the cell. Or we can say, run everything about or run everything below. So that’s one way to do that. Another way to do that is that you can hit shift return. When you hit shift return, it’s gonna to actually get that particular cell. In other words, it sends the command to the driver on the cluster, execute that particular cell and returns you the result and goes to the next cell. If you want to remain on the same, you can just do control return. And this is going to execute that particular cell over there. Now, one thing good about notebooks, those are used to is that you can actually use annotations and you can actually use markdown to put images and share the narrative of what you’re actually doing with your folks. I’m just making sure I’m good in time. So our first model is we’re gonna actually create a regression model to predict the guest consumption in millions of gallons small set of data in 48 States. And some of the key features over here, we’ve got patrol tax incense. We got per capita in us dollars. And those are the features that we actually have. If you wanna predict what the consumption is and billions of gallons. So what if I told you this seems to be like a regression problem, and it’s a really sort of a bad problem to have. Now, those of you who are familiar with the psych kids, a random forest regression. I won’t get into the theory of that, but normally for any regression problem, you have standard four metrics that you actually wanna work on. And those are the mean absolute error, which is the difference between the absolute value between the predict value and the observed value. The mean squared error, which is the mean of the square of the difference between the two. And the root mean square, just a root of MSE. And then R-squared sort of gives you an idea of how far we are from the accuracy. And what we’re gonna do over here we already have an existing code, but we’re gonna infuse our code with MLflow tracking it. So you can continuously run these experiments so that you can use the API as fluent manner. And then we can look at the MLflow UI for you to be able to compare and contrast experiments. And then the next thing to do is that well, what is it that I can do? I don’t like my metric. What are some of the parameters that I can change? What am I missing? What are some of the regularization techniques can I use? What are hyper-parameters tunings that I can use? And this is where the oil aid experiments actually comes in. So let’s go and do that. So one of the things you can actually do within a notebook is that you can actually run existing other notebooks prior to running this. And the way I’ve structured this particular code is that I met this particular notebook as a driver, as an experimenter for me to run all the different experiments. And all the code that actually does the training and all the code that actually logs the parameter is sort of abstracted away in classes that I actually use. And that bring those classes into the scope over here, by running this particular setup called setup underclothes classes. So what does this actually look like? Well, let’s go back to my workspace and I’ll click on this directory called setup. And if I look at my class setup, I have nothing but the same thing, go ahead and run. Now run, is a magic command that tells the notebook to say, go ahead and execute this particular other notebook. So I have this class called lab utilities. I have this class called regression. I have this another class called classification. And these are not the instances of classes that I use in my driver to run the experiments. And then I have creating a base class. So let’s look at what this particular utility class look like. I’ll go back over here. And I look at Lev utils. And those are a few python programmers. These are nothing but real high level classes. And I made these aesthetic or a singleton class whereby I can use a single instance to use all these particular functions. So I have a method that allows me to load the data. I have the ability to use map plot, to plot the graphs that I want. So I don’t have to, pepper or repeat this vertical code in my notebook. I have a setting method that allows me to plot the residuals. I can use this method over and over again for different datasets. Some utility function that I love to create that I have a plot function that allows me to print my confusion matrix and create a plot ________ So I can store that as an artifact. So that essentially is as you can see a utility library that I can use over and over again, across my other functions that I need to use. But more important is this particular class, they be actually is important. So this is my random RF model that I’m gonna use in my driver. As you can see, I’m getting out of time. And I have two sort of class wide instance variables called RMSE and estimators which I’m gonna use to keep track of how my RAs, my root mean square compares with the number of estimators. Those are the parameters that I’m choosing to see if my model is doing any better. And then I have a factory method called new instance. So then I can use the same instance to create a new version of the random forest. With a different set of parameters so I can experiment those. So that sort of gives me the ability not to rewrite code I can just delegate that code over here. And then the parameters, I just get us for me to actually get the parameters that I need. And then here is the gist of that. So I have this mythical MLflow run that takes couple of arguments together. And as I said, I’m losing my mouse. This is where I use the MLflow API to do everything, right? I’ve just surrounded all my MA code that you use standard to do your training. When I stopped this particular run, it’s gonna stop run session with this tracking server. And I’m gonna run new idea, I’ll get my run idea I’ll get my experiment. And then this, I just, the extra parameter is the features that extract from my data frame that actually comes in. And the Y parameters that I need. And this is very standard ML code that most of you are gonna be using on daily basis. Right over here I’m gonna split my data using cycle run train split. And I’m gonna scale the data. Now, this may not be necessarily because in random forest, this scaling or the normalizing of data is not imperative. In other words, you want to make a difference. But I’m just using it here because the way you actually create the Gini coefficient per node as you split, is on one particular feature. And you don’t use the Euclidean distance. But for other algorithms, it’s important to actually to scale and normalize that. But one way to do that would be, what about if I did an experiment where I use the standard scaler and I use the min-max, they didn’t actually make a difference. I would argue it probably won’t, but I just kept it there for argument reasons. And then I do my train in prediction, and then I log my model. Now I’m using psychic run. I’m using MLflow log to look the particular model. I’m gonna call the random forest. I’m gonna log my parameters. Now, there are two ways to log the parameters using the MLflow API. You can either log individual parameters, or you can provide a dictionary. So that actually looks everything in the batch. So my code is literally short. I just do MLflow log params. And whatever the parameters that I give to my algorithms those are the things that I wanna track. Those are the things that I wanna keep track of. I can just go ahead and log them. Then I compute my regression metrics, as you would. Once I’ve done them, I’m gonna log those particular metrics over here.
We saw that in particular simple example. I’m gonna append and update my constants and variables. I keep track across all the experiments to see how my root mean square error change with the number of estimators that I used. And then I’m just printing some data up.
So that’s really the gist and the essence of my code. Oh my goodness, I’m running out of time. So let’s go back over here and let’s run the setup.
So this would just, create my instance of the particular class, bring all the classes in scope. And then now once I have all these classes, I can start running my experiment.
These are just instance of classes created. So next thing I’m gonna do, I’m gonna load the data set and this particular door utility, as you actually see, was a utility function that allowed me to a load that. I’m gonna go, that gives me a pandas data frame. And I can look at the data and see what it looks like. And it turns out that I have- These are my data, and this is what I’m actually sort of predicting on the 48 sets. Now, the data set is actually small. So we’ll see how it actually fares. Let’s look at what the data looks like across. This is the mean of what we’re actually predicting. So we probably wanna have our RSMB a fairly good, so that it actually predicts within the mean of that particular error. And here’s my driver. This is really the gist of my entire experiment. All I’ve done is now I can go in and change any parameters that actually I want. And I can run this indefinitely till the cows come home. And I can change and keep on tuning this parameter. I just create a new instance and create a new MLflow run. And I’ll go ahead and do that. So let’s go ahead and run this with the range of 20 standing with all the way up to 250 estimators with increments of 50, and I keep on changing the depth. And so these are under my parameters. I’m actually sending to a new instance of it. And then I’m gonna do a run experiment with this particular data set. And this is just gonna call the MLflow run to do all the logging for me. So let’s go ahead and do it.
So this is gonna run, and once it’s actually done, it starts printing all those out for me. And it it’ll return me the plot as well. And then we can actually look at the experiments within the MLflow UI to see what it looks like.
So this is what my sort of root mean square area was estimated as in a rough way to do that. You can actually see that it started actually going down significantly as I increased my number of estimators and then decided to going up again. So at some point, it’s not doing a good job. It’s either all fitting or it’s deviating from its lowest RMSE. So we know that at some point there is something that we actually can do to regularize that. So we can actually get a better matrix. So how do I actually look at those particular parameters that actually used? So it turns out that within the Databricks community edition, the MLflow UI, that you actually saw a glimpse of it is already part of the notebook. So I can actually go to this particular little icon over here, that says runs. And if I click on this particular runs, what I get is a summary of all the runs that I ran. So I can look at at a high level what my MA was, what my R-square error was, what my RMSE for each and every run. And it gives me the parameters as well. So this gives me a quick summary of what my runs, or what would my experiment. But what if I actually wanna go big down into the details of it? I can click on this pretty good icon, as you see. And now it actually gives me the UI.
So this is my MLflow for the UI, which is actually very much integrated with it. And as you can actually see, I have metrics available for me over here. If you click on this, it actually give me the RMSE. I can drag this a little bit over here. And I have older runs that I ran. I had about five runs and each has different parameters. I can actually look at all these different parameters at a glance in a very tabular way where it’s easy for me to look what those are. I can click on any of these and compare so I can compare those. And now I get a very nice tabular way of looking here are my parameters across all the three to see which model actually gave me the best one. Here on my RMSE across, seems like this probably was the best. And then I have a scatterplot that allows me to sort of visually see the correlation between any of these parameters. And I can change those. So this is my parameter for, depth was eight, and after eight you can see that didn’t make a difference. It actually went up. So the more estimators and the higher depth I have, it’s not gonna make a difference. There’s some fine point over here that you can actually sort of use. I can add other thing I can use my RMSE against the mex depth to see how it actually fairs. This gives me a good way to look at the scatterplot to see if there’s a correlation. Alternatively, I can actually go to the individual run and look at how the individual run fair. So these are all my parameters for the individual runs and they have actually all the metrics that I logged. I can put my tag over here If I wanted to, I can actually create a tag called shop and my tag. Here are all the artifacts that I actually created Which actually logged. And the ability for you to actually say, well, what are the my model look like? And next week, when we talk about the models, we’ll get a better idea of what it actually means to be a model. If I click on this particular model, it actually gives me what the model looks like internally. And this is just a YAML file that tells me some of the key value. The flavor that I actually created has two flavors. One is a python function, and the other one is a psychic run and serialized versions pickle. I had the version of python 3.3 that I used, and this is my run ID. And you’ll get more idea in part two when I go deeper into this particular model. At any point, you can download this particular file. The important part is this here condo. So the condo UML is really a way to capture the essence of how the model was actually better. What were the dependencies that I used to train this particular model? So if I wanted to reproduce that, I need these dependencies and it texts dependencies and captures. I use python 3.73 on my community edition. I use the psychic run version of 20.3. I use the MLflow and my pickle was 0.8. And that’s then important part of it. Here are some of the estimators that actually used. I used to load this and that allows me somehow to look at visually, have my root mean square progressed or decline, or the metric decline as I increased the number of estimators. And you can plot this different things. What was the correlation between the number of depth of the trees and number of estimators against any of the particular metrics. And it can create that you can really use that. So this whole idea about experimenting, the whole idea about running experiment. The whole idea about visualizing this is what the tracking allows you to do that. And then next week when you talk about models and projects and models, we have an extra step to see how does that actually fit in together. So that in essence what tracking is all about. And I think I’m almost out of time here. Unfortunately, we have these two other notebooks for you to run. You can actually do the same way do that. And I have homework assignment for you guys, which will prove to be a cure to your insomnia. If you’re arrested at night. These are the home climate change that to a linear model and see I’m actually works. So I think that in essence is MLflow tracking. And if I can go to my slides again.
And if you have any questions, feel free to ping me. DM me at @2twitme or just follow me on LinkedIn.
Dive through the internals of Delta Lake, a popular open source technology enabling ACID transactions, time travel, schema enforcement and more on top of your data lakes.