PyCaret is an open-source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of environment. This talk is a practical demo on how to use PyCaret in your existing workflows and supercharge your data science team’s productivity.
Moez Ali: Hello everyone. Thank you for joining me today in this talk machine learning with PyCaret. We have a tight agenda today. This is a one-hour talk so I’ll introduce myself quickly. Then we’ll talk about machine learning life cycle in general at a very high level. After that, I’ll introduce you to PyCaret which this talk is all about. I’ll talk about what’s the ways of PyCaret. And then towards the end of this session, we have four demos. And I encourage you to follow along with me in that section. And then we’ll end our talk with question and answers. Okay. A little bit about myself, my name is Moez Ali. I’m a data scientist. My background is in finance, economics, computer science and data science. I have worked in industries like healthcare, education, consulting, and right now I’m working in FinTech space here in Toronto, Canada, and me and my team are solving some interesting time series problem.
We are actually inventing some novel methods to predict the rebound of the economy after COVID-19. So it involves a lot of economics, finance, as well as data science, machine learning. In the last five years, I lived in four different countries. I’ve lived and worked in four different countries and all of them happened to be in a different continent. These days I’m based in Toronto, Canada. My open source project, I do a lot of open source community work. The most known work that I’ve done is by PyCaret. You can see some links below, down on the slide. My LinkedIn, my Twitter profile, Medium, I’m a very active blogger. I write at least three to four times a week. And I write mostly about big data levels, citizen data scientist tutorials, and something about data engineering too.
And there’s my email in case you want to contact me. Some important links. So official website of PyCaret, GitHub, LinkedIn, but the most important thing here is this last link, github.com/pycaret/pycaret-demo-dataai2021. So this presentation, as well as all the tutorials and notebooks and data sets that we are going to use today, all of them are uploaded on this GitHub location. Please feel free to clone it and follow along with me in the tutorial section if you want. Okay, so let’s get started. So what you see in front of you is a really high level of machine learning life cycle which just starts with business problems. So this is where business would make a use case of what they’re trying to solve. Is it the regression? Is PyCaret trying to predict [inaudible] value or you are trying to predict an outcome or maybe you are not trying to predict anything, you just want to segment your customers in which case that would be a data stream problem as you may guess. This is the most important stage.
Unfortunately PyCaret cannot help you with the business problem or use-case, you have to come up with that by yourself. But once you have your business problem sorted out there, you have met your key stakeholders and you have agreed on the project timelines and deliverables, you get into data sourcing and ETL stage. This is where you would usually source the data from organization and enterprise databases. This could be as simple as a local SQL server or something as complicated as a cloud managed data warehouse service like Snowflake. And depending on the use case, sometimes you don’t have data readily available. So for example, if we talk about customer gen prediction problem and you have to predict whether the customer would leave or not, usually what you want to do is predict it with some lag, you want to predict it three months in advance or six months in advance, in which case you don’t have that data sitting on the server like the way you want.
You have to create that lag, right? So data sourcing and EPL may also involve data generation processes sometimes depending on the use case. After data sourcing and ETL, the next step is exploratory data analysis. This is where you would investigate or I would say evaluate your raw data. And you would visualize it to figure out things like if data has missing values, what’s the overall quality of data, how many pneumatic categorical features it has and what’s the correlation, what’s the distribution, stuff like that. At this point modeling hasn’t started yet. We were just checking out your data. And so normally data scientists would do very detailed analysis and form some hypothesis or assumption that they would test or execute in steps ahead.
The next step is data preparation. So this is different than ETL. Data preparation in the context of machine learning would include things like your screen test display that you have to do, missing value computation, scaling, transformations, feature engineering, feature extraction. So all of that sort of thing that would make sure data is in a shape that is ready to be consumed by algorithms. The next step which most people are excited about this one, model training and selection, and as the name suggests what it involves is training multiple models, training multiple estimators and then tuning their hyper parameters, maybe ensemble them together through voting or stacking regressor or classifier depending on the use case. And then eventually finalize one best model that you would use in production and satisfy some business objective.
And naturally the last stage of this process is you would take that one best model, one best model maybe out of 50 that you trained, or maybe out of 500, or maybe out of 5,000, it could be anything, it depends on the use case and the level of accuracy or precision you are going after. And how do you know that? It totally depends on the business use case, right? Are you trying to detect the disease where it’s a matter of life or death or are you trying to know whether the customer will respond to the market movement, right? So this entire process is very iterative. It kind of runs in a loop. A business problem is out of loop in this diagram, but that’s because once you agree on a business problem, you won’t change it, right? If you change it, in my view, it’s a new project. So this entire iteration happens. And it happens in a loop, right?
And if I just drill down into these two aspects, because this is very high level workflow. If I drill down into data preparation and model training and selection, what you would see is a very granular diagram which looks like this. And if you read it from left to right, you would see stats from data, the first step in a typical machine learning or supervised machine learning experiment is train test split. You can see you have your training data going right through the flow and test data, at the beginning of experimenting, we have just taken our test data, we have locked it. We are not using it for anything. And then you have this train data, you just start by doing things like missing value imputation, scaling, encoding, if you have categorical features, you cannot directly consume them, right? You have to encode it from something as simple as one hot encoding or maybe as complicated as weight-based encoding.
And then you have things like feature engineering. And then at one point, when you are done with preparing your data, you enter into another loop which is here in the bottom right corner, cross validation. This is where you take your data which is now ready for modeling. And you start fitting multiple models, multiple estimators, multiple environments, right? And you run into a mini loop when you train it, you turn it, you ensemble it, and then you select one. And then again, you would see arrows going in the left direction, which basically means once you have that one final model, you would apply it on tests as your final check before you take that model and put it in production, what you want to do is evaluate it on test or holdout set to see if it’s not over-fitting or you have not messed up with your cross-validation in mind, right? And then you finalize your pipeline. The pipeline consists of all the transformation plus the estimator. You deploy and keep monitoring and it would keep running in a loop. Right?
And the reason I say in a loop is because imagine your use case involves time series prediction at a skew level. And what you want to do is you want to retain those models every month, every week, or every day depending on the use case. So you have to set up this process in a way that you are able to retrain your models in an automated way in production. Okay. So what are the challenges then? So based on this process, if you look at it or think about it, machine learning is an iterative process. And because it’s creative, it is very, very time consuming. Right tooling in the hands of right people is very important here. Now what is changing in the last few years is companies are now establishing their data science teams which are functioning. So data scientists in marketing, data scientists in finance, data scientists in HR and what happens with that shift is typically the persona of a data scientist which is normally very computer science and statistics heavy, we are missing that when we are scaling the culture of data science and machine learning.
The tools that are working for typical software engineers are not necessarily the same tools that will work for a data scientist. So right tooling with the right people is very challenging here. I think creating seamless pipeline as we have seen on our last slide, it’s not only a model, but it’s the entire pipeline that has to be in a sequence and orchestrated that will actually satisfy your goal for the project. It’s not just model, et cetera, but the entire pipeline, right? And managing it in production, is even harder than creating it. Focusing on end-goal and solving business problems is absolutely key. This is the whole idea. This is the reason we are doing this. And what happens in the smarter teams that doesn’t care about all this, they take technical debt very quickly.
And when that happened in small teams, the focus of end-goal and solving business problems can take the back seat on the expense of maintaining code and maintaining technical infrastructure. And the final point which I’m sure everybody would agree with, scalability is not just desirable today, it is very much needed. We have to take our models outside of notebook, and only then it’s useful, right? Only then it’s serving some purpose, it’s generating some revenue or minimizing some costs. Models and notebook eventually done.
Okay. So what PyCaret? So PyCaret is an open source, low code machine learning library, and it’s an end to end model management tool which basically means it takes you from data preparation to the last state of the workflow that we’ve seen which is deployment and monitoring. It is commonly used for rapid prototyping and deployment of pipelines on cloud locally. And this is kind of our key proposition. PyCaret is extremely easy to use. It’s a productivity tool because it saves a lot of time and it’s targeted. It’s designed, it’s intentionally designed, consciously designed for business audience. Features of PyCaret, we have data preparation. So this is things like missing value imputation, scaling transformation, PCA, model training, very self explanatory, it involves training models. Hyperparameter tuning is hyperparameter tuning of modeling, it’s self exploratory. Analysis and interpretability involves things like you checking AOC plot of your model.
You’re checking confusion metrics of your model. If it’s an aggression, you’re checking residuals plot of your model. QP plot of your model, or you’re even checking shop values of your model. That’s part of interpretation. Model selection is something that happens naturally. As you are doing all of this, that’s the whole point you are doing, right? So you can iterate this entire cycle and select one fine model. So model selection is the natural outcome of this entire process. Experiment logging is really important because as you iterate over that loop again and again, you generate dozens of metadata points sometimes in hundreds of thousands and sometimes in millions too, right? Imagine if you have a use case of time series at a store and skew level, and you have 500 time series and you are training, let’s say 50 models and you’re tuning the hyperparameters, for 500 times series, imagine for every model you have at least 10 or 15 hyperparameters for every model, maybe you would keep track of five or six performance metrics like accuracy, AOC, recall precision.
As you do the math, this would be millions of data points. So it is very important that you keep track of that, right? That they are very important. And PyCaret has integration with ML flow which is open source project by database. And PyCaret automatically do that logging for you and you will see that in them. Okay. So the use case that are currently supported in PyCaret is classification, this is where you predict discrete outcome, binary or multi-class, we call it classification. Regression which is predicting continuous value. These are supervised experiments which means that you have to identify the target column in your data set which essentially means that you should have a labeled dataset. Unsupervised modules include things like clustering, anomaly detection, association rule mining and NLP. And we have a brand new module in the next couple of weeks, time series which we are very excited about.
Okay, what do you see here is one experiment that we have done, it’s impact of PyCaret. On your x-axis, what you see is the workflow, data preparation, model training, model selection, model evaluation. This is the same workflow broken down in four categories. On Y axis, what you see is cumulative lines of code. So we have set up this experiment where we have designed a list of tasks which would involve things like preparing data or train test split or training multiple models analysis and then obtaining predictions. And we have performed those tasks using base scikit-learn code, pandas code, all the base libraries. And then we have performed the same experiment that produce the same results using PyCaret. This green line here represents the scikit-learn and red line is PyCaret. And you can see by the time you finish the experiment, scikit-learn, pandas, [inaudible] we had to write around 170 lines of code.
With PyCaret we have got the results with 20 lines of code. So now if you think from the perspective of somebody not coming from computer science background, somebody coming from a domain or a functional expertise, for them this is great motivation to get into this or to basically get motivated to use these kinds of technology to solve their problems. Okay, these are some numbers. So PyCaret, our first stable release was publicly announced in April of 2020. So it’s almost like a year. And last year we had over more than 500,000 downloads. We have over 3000 GitHub stars, 1700 comments and the most important number here on this slide that I’m very proud of is contributors. We have about 46 contributors contributing to this project now. And I would take the support unity to tell all of them that this was not possible without them. So I’m very humbled, very thankful to them.
Here’s our 46 contributors, shoutout to them. Okay, you can use PyCaret on CPU as well as on GPU. So if you have code enabled GPU which is compatible, you can use PyCaret without any additional configuration. There’s just a parameter that you have to pass in the setup function, you will see that in the demo. And that’s it. But how we do it, we do it based on this project Rapids AI project, and they have two libraries, CUML and CUDF I think. This is amazing project. And we are using this library to provide you that GPU training functionality. Similarly, if you have very big datasets and you are doing hyperparameter tuning, PyCaret has integration with Ray which provides this distributed framework, distributed processing framework in Python. So again, these two great libraries have made it possible for us to do what we are doing.
There are a few more integrations, so you can see scikit-learn, Rapids. These are for model training. Ray is distributed processing, MLflow is for logging and ML ops, Yellowbrick, Plotly is for plotting and charting functionalities. Optuna is another hyperparameter tuning library. Gensim and spaCy is for our NLP module. So PyCaret was not possible without all these awesome, amazing open source projects. So that’s the acknowledgement.
Okay. So this brings us to the final part of our presentation today. Just a reminder, if you would like to follow the demos along with me, here’s the GitHub link for where you can download all these notebooks, github.com/pycaret/pycaret-demo-dataai2021, with that let’s head over to notebook. Okay, here I am on Demo 1, regression. This is the first notebook. If you do not have PyCaret installed, you can run pip install pycaret either in your notebook or in command line. It would take a few minutes to install pycaret. From this demo, I’m using this regression dataset which basically each row is a patient. And there are six attributes of patient which is age sex, BMI, children, smoker, region based on which we have to predict charges column. It’s a really small data set just for the purpose of demonstration.
Okay. So the first step in PyCaret in any experiment whether it’s regression, classification, clustering, it’s all the same. The API is unified. So the first step is to execute the setup function and what setup does, it essentially prepares your data for modeling. So all the data preparation steps such as train test split, scaling, transformation, one hot encoding, missing value imputation, feature engineering, et cetera, et cetera, all of them are done at this stage. The setup function takes two parameters. The data frame which is this variable here, and the name of target column which is charges. Session ID is basically a random number that helps you reproduce the experiment at a later time. Let’s run it. So when you execute this function, the first thing PyCaret would do is it would infer the data types for each column.
And if you’re okay with these data types, you can just press enter to continue. If you’re not okay with the inferred data types, there’s a way to override. If you read the documentation, you would find out there’s a way to override the inferred data types. Anyways, this function, if successfully completed would return you this state, this output which basically shows you a couple of important information. You can see our session ID is here, name of target column, what’s the shape of original dataset, were there any missing values, how many features where numeric, how many were categorical, so and so forth. And you can see the split was done here. Train set has 900 rows, test set has 400 rows. So it’s a 70 and 30% split by default but you can change that percentage.
There are a bunch of other information such as fold generator, shuffling, how many cores you can use in your CPU by default, set to minus one, basically means all the cores. You can use PyCaret on GPU. So if you have Nvidia code enabled GPU, you can pass, use_gpu is equal to true, like this, and all the workload would fall on GPU if you have one. I don’t have GPU on this computer. I’m going to just get rid of this. Okay. Here’s how you can access the transform train set, get_config (‘X_train’). So you can access a bunch of variables that are created in the background. This is the list of all variables. You can use get_config function to access them. And if you notice you would see that we have a bunch of new columns here. They are the result of one hot encoding on categorical features.
There you go. All these columns are one hot encoded categorical features. Okay. So now that data is ready for modeling, the first function that will come in any supervised experiment in PyCaret is this compare model function. Let me just run it. What is happening here now is all the available models in our model library, we are training them one by one using K4 cross validation on train set. And what you see here are the metrics using cross validation. And this table is ordered into highest to lowest performing model. So let’s see. Let this one finished first. Okay. So here we see all the train models and their metrics using K4 cross validation. By default, it uses 10 fold but you can change the number of folds bypassing fold parameter in compare models or even in setup. You can also pass fold parameter in this setup.
Okay. So we have our batch model which is gradient boosting regressor doing $2700 in mean absolute error. Let’s just see what is it. So this is gradient boosting regressor, and these are all the default hyperparameters of this model. Let’s check the types. As you can see, this as scikit-learn GBR. All right, while compare models is a very good function to have a baseline or a starting point ready, create model is a more granular function and it basically trains one model at a time. So if I say, create model dt, dt is ID for decision tree. And if you check the documentation here, you would see all the models here with their IDs, so lr is for linear regression, lasso is for lasso regression, so and so forth. We have used this one, dt. If we want to frame support vector machine, the ID for support vector machine is svm. And there you go. It’s the same thing. Now each row here presents a fold, right? There are 10 fold, so you can see zero to nine.
If you want to change the fold, you can just say, fold is equal to three. Now you have three folds. And there’s a mean, which is basically average of each fold and its standard deviation. What you see here in compare models is basically the mean cross validator score. Right? All right. Let’s sail back to dt. Let’s print dt. So this is decision tree doing $3,148 in mean absolute error. And if you see the hyperparameters, these are the default hyperparameters of decision tree. Now you can do decision tree with this tune model function. Let me just run it. And let’s see from $3148, we have gone to $2051 by tuning the hyperparameters of this model. How do we get the hyperparameters of this model? Behind the scene, we have dynamically defined the hyperparameters search space for each of the estimator. And based on which estimator you are using, we are going to use that search space to retrieve all parameters randomly.
Now if you want to pass your own search space, you can pass custom grid here in this function. And then instead of using our defined grids, we’ll use the grid that you pass. One thing to note, as by default, we are using random grid search which basically means that we randomly trait over a combination of hyperparameters to find the best combination of parameters but there are other methods and probably more effective methods to tune the hyperparameters of your model. And those methods include things like a bayesian grid search or treebase grid search. And there are a couple of open source libraries that provides that functionality. The problem is the API for each of them is different. So in PyCaret, you can actually pass search library. And let’s say, optimize one of that hyperparameter tuning library. Now if you run this, now the search space is the same, but this time PyCaret is using the method defined by Optuna to iterate over the search space.
Similarly we have scikit-optimize and there are a few other options that you can check at your own time. And if you have a very large data set, PyCaret has integration with Ray, so you can actually tune the hyperparameters of your model on a cluster. Okay. Similarly, we have an ensemble model function which basically takes your estimator and wrap it around bagging or boosting estimator. So in this case, we have a decision tree that we tuned, and then we pass the tuned decision tree with these hyperparameters into ensemble model. And now if I show you what is my ensemble model, this is the same decision tree wrapped around bagging regressor. Okay. Now if you have heard about bagging aggressor, you would know number of estimators is really important. So by default it’s set to 10, but if you want to increase, you can obviously take this parameter and now fit, let’s say 25.
So now this would build the tree string 5 times instead of 10 times by the fall. And as you can see, the performance has improved a little bit as I increase my number of estimators, but you have to be careful about overfiting when you do things like this. Okay. There’s another way of ensembling models which is basically you can train individual models and then blend them together which is take their predictions and average them in some way, weighted average or some kind of a normal average. So here I’m training three separate models, decision tree, lasso progression, and knn and passing them into blend models function as a list. And what you see here is the same tenfold performance metrics. And if I show you blender, this is you’re voting regressor from scikit-learn and all these are your estimators that you passed inside the list.
The type of this is ensemble voting regressor from scikit-learn. Similarly, we have stacking which is also ensembling technique, it works a little bit different. Instead of taking voting, it basically takes the output of these estimators as an input for a meta model and then train a final meta model based on these inputs. It’s like neural network without back propagation. So you can see, we have this models here, decision tree, KNeighbors, and lasso wrapped inside stacking regressor. Okay, we have a function. So now that we have trained a bunch of models and we have now our best models, we can analyze the performance using a few plots. So now I’m passing my best model, which is gradient boosting regressor into evaluate model. And you would see this nice level UI where the selection at the moment is hyperparameter.
So you see this, but you can actually see the residual plot. You can see feature importance if it’s available. You can see a bunch of other things. Cooks distance, prediction error, et cetera, et cetera. We have another function called interpret model which is pretty interesting. There’s a library in Python called SHAP which is a way to interpret your complex tree based models. It is not very easy to explain the results of this in such a short time. But the code is here just so in case you want to use it in the future. Okay. Now that we have our model, let’s see the performance of this model on test test, right? Because so far, whatever we have seen is the cross-validation performance.
Now, if we use this function here, predict model and just pass your best model without passing any dataset, PyCaret would know that you want to check this score on holdout set. So in this case, my gradient boosting regressor is doing 2386 on holdout set. And if I check my cross validation, 2702, so even better. The difference between the cross validation metrics and holdout, if it’s a very large difference, it kind of indicates overfitting or underfitting. In this case, I’m not worried. I’m just going to go ahead. So here, have used predict model and stored it in pred_holdout variable. Now if I show you the head of this, this is your test set, and you would have label column towards the end which is your predictions. And these scores that you see here is basically metrics ran on these two. So this is your actual, this is your prediction, right?
Now more than likely you would be interested in using this model to generate predictions on new data set where you don’t have target variable or label, right? To do that, I don’t have any production data set for insurance. So I’m going to copy the same dataset, drop the charges column and create a new dataset which looks exactly like this except there’s no target column. This line here is finalize_modern(best). This actually means that remember when we initialized this setup, the dataset was split into two parts, train and test, right? All the modeling that we have done so far is on train set only. Finalize model would basically take your model and the same hyperparameters and it would just refit one final time including the test set. So it would fit it on entire data set because you don’t want to leave your test set on the table.
Now you can use best final under the predict_model function and pass your data which is data 2 which is this table here. And it would return you, pandas data frame with your original features and the label column which is your prediction. Notice that what you see is a human readable output. Now models are not consuming these variables like sex, smoke, or region. As it is, they are transforming them and coding them. But when it returns you the table, you see human readable output, everything is happening, all the complexities handled under the hood. Save model here is how you can save your entire pipeline. So save model, best final. So now you see it’s not just an estimator, but it’s a pipeline which has a couple of steps and the last step is our model here. This is the model, but everything else is transformers.
How big or small the pipeline is dependent on what you do in this data function. This time we haven’t used any functionality for pre-processing, but we have a few simple things here like data inference, missing value imputer and one hot encoder. Okay. With load model, you can load your pipeline back. And if I just visualize this pipeline, you would see we have auto inference, we have imputer, we have new categorical levels. How to deal with new categorical levels when you have new levels in your production data. Then you have a bunch of other transformers and here’s your model at the end, gradient boosting regressor. This function here, deploy model can basically take your entire pipeline and deploy it on AWS, Azure or GCP. In this case, I’m using AWS. I have my bucket name here by PyCaret-test, if I just run this, this would take the entire pipeline and push it to AWS S3.
And here you can use load model to read this pipeline back from AWS. This function works the similar way for all the cloud services. If you want to push it to Azure, you would just replace this term with Azure. Okay. Now we have this loaded pipeline. If I show you again, it’s the same pipeline. So here I have locally. Here I have loaded this pipeline using my local drive. Here I have loaded it from AWS. It’s the exact same pipeline. Okay, with that, let’s head over to our second demo here which is time series. Okay. So I’m using this data set here which is US air passenger’s dataset that has monthly airline passenger numbers from 1949 to 1960, I just quickly plot this dataset over a line plot.
Okay. So this is how it looks like. So you can see it starts from 1949 and goes until 1960 and this blue line, you can see the peaks and lows. This is basically the summer and winter seasonality. The red line here is the moving average for the last 12 months that we created here. This line here. Now if you just see the moving average, you would see there’s an inclining trend, and that’s the reason I captured moving average.
So this time, even before you start your modeling with PyCaret, what you have to do is extract the features from date. So this data set here, you cannot consume dates directly for modeling. So you have to extract features such as the month, day, if it’s a daily data. In this case, it’s not, year, this quadrant is month end and stuff like that, right? So I have basically converted my [inaudible] data set which looks like this into something which looks like this. In this case here, my passengers become my target feature and all these are my X features. I’m doing a train test split before the set up command. So I’m creating two different datasets out of these. One is train, one is test. And this time I’m explicitly passing train and test set to PyCaret.
Normally if you don’t pass a test set, separately PyCaret would do a train test to split randomly, but here I’m explicitly passing it, target is passengers, fold_strategy, time series. Because by default, PyCaret uses random cross validation. With time series, you cannot do that. You have to respect the order in your data which is the date order in this case. Numeric features, so I’m explicitly passing the data type into PyCaret. So I don’t want PyCaret to infer this. I’m just explicitly defining it. Fold is 3, so three fold time series validation. Transform_target, true, so because the target here, air passenger, US passengers, it’s not a stationary. It’s a moving average. The moving average is inclined, right?
So what I’m doing is in order to model better with my linear algorithms, or regression algorithms, I’m just doing transformation on target variable, not on X features but the target itself. So by default, it would use Box Cox transformation. There are a bunch of other options such as Johnson, Quantile. By default, it’s going to do Box Cox transformation, but everything is going to happen under the hood which basically means you can simply call predict of the model and it would predict the transform target variable, but when it returns the output to you, it’s going to inverse it. So it would return the output in the actual scale, but all the complexity is happening underneath. Session ID is the same concept. It’s a random state. Silent, true. So when you pass silent, true, the set up function will not ask you for that kind of information where you press enter.
So imagine if you want to run this as a script in your production, you don’t want to have PyCaret conforming data type, right? You would explicitly define types and just say silent is equal to true. These three variables here, log_experiment, true, experiment_name and log_plots, true. So because PyCaret is integrated with MLflow, when you pass these parameters here in set up, what we do is any metadata that you create in your modeling process, such as your hyperparameters of your model, performance metrics on cross validation or even the model artifact itself like model pickle file or even plots, PyCaret would automatically log everything. And at the end of the experiment, we can just initiate an MLflow server on our local host. And you would see there’s a really nice looking UI and very useful. Let’s run this function.
Okay. This time notice, it didn’t ask me for any confirmation because of the silent true. If I just set it to false, you would see it. It’s usually ask this, but if you’re running this as a script, you really don’t need it. You can just say silent true. Okay. You can see my index in extreme set starts from zero to 131. I’m expecting test set to start from 132. So all these last 12 points are for 1960. Okay. Same thing. We’ll start with comparing models. This time we are fitting three folds because passed fold is equal to three explicitly, and this is not random cross validation. This is time series rolling window cross validation. Okay. So it’s almost done. Okay. So our best model based on cross validation is least angle regression with mean absolute error of 22. Let’s remove MLflow now. So you see our best model here it’s this one. It’s wrapped around the power transformer because of that target transform parameter we passed in set up. Let’s check the holdout score. It’s 25 compared to 22, not bad. We can check evaluate model, feature importance, prediction error, cooks distance.
Okay, now I’m going to plot this. And so you have these two lines, blue and red. Blue is your actual, red is your fitted line which is prediction. And this grey backdrop towards the end is our test set. So this is 1960, the last 12 points of 1960. And if you compare this, it looks like a good fit. Okay. Now I’m creating a future data set here because I don’t have any X variable like you would have normally in regression. So I’m creating a future data set that starts from 1961, it goes to 1965. So this is exactly what our original data set looked like. I’m going to finalize my model, least angle regression, and now I’m going to generate predictions on the future data set. So now you can see from 1961 the label column is our predictions.
Let’s plot it together with our original dataset. And this is what we see. So all this blue points here, this is your actual which ends in 1960. And from that point onwards, this is your prediction. I’d say it looks good. Okay. Now remember we have passed log experiment is equal to true in set up, right? Where is it? This thing here. Now let’s see what’s the effect of. This command here, MLflow UI, you can initiate a server on your local host. So if I run it, let’s give it a minute. Okay. All right. So what do you see here is air passengers. This is the name of the experiment we defined in the setup function.
You’ll see a bunch of runs by their timestamp. And what you see here is all the individual models, so when we ran compare models, it has strained bunch of models, right? So each model is right here and you would see lots of things here which is the parameters, their performance metric. Let’s get one of it. So least angle regression, date, time, all the parameters, all the metrics of least angle regression, some tags. And you would see these results that it’s the same table that we see in the notebook, but it’s now logged as an artifact. And then you have model here. So you can actually call off this and score your new data set. Right? Okay. Let’s go back and stop this.
Okay. Let me head over to demo three which is multiple time series with logging, this notebook here. So what I’ve done is in the interest of time, I have pre-run this notebook already. So this is the dataset. It’s uploaded on the GitHub repo, store_demand. I have sourced this data set from Kaggle. And what this data set is it’s a time series data set for 10 different stores. And each store has 50 skews. So basically around 500 different time series at daily level, from 2013 to, I think it’s a five-year dataset. And if you see to 913,000 records by four column. What I’ve done here is I’ve just filtered one store just to kind of show the demonstration. So one store would basically mean 50 different time series because each store has 50 skews. Here, I’ve done the same thing. I’ve extracted the features, just like we have done in the last tutorial, but you can see, we have some extra features here like day of the week or day of the year because it’s a daily level dataset unlike the last one which was monthly.
And we have this column here, time series which is just a unique key. So it would have store_1_item_1, store_1_item_2, store_1_item_3. And if I just call any unique of this, you’d see there are 50 time series. Let’s just visualize three of them. You see the first one is store_1_item_1. This is how so it looks like. A red line is moving average. Let’s see store_1_item_2, store_1_item_3, all of them look pretty much the same, but notice the Y axis for all of them are different. So the scale is different. Okay. Here’s our training loop. So what I’m doing here is importing PyCaret regression and then just running a loop over time series unique values. So a loop that would run 50 times. So the first step inside the loop is it would filter the data set to that particular time series and then would just execute set up based on that. Again, you can see, I am explicitly defining data types and that’s why I’m passing silent is equal to true.
This time also passing verbose is equal to false because I don’t want to output as the code is running. Again, log_experiment, experiment_name and log_plots is true. So once this function completes, you can see it takes about 19 minutes to ran 50 different time series. So in this 19 minutes for each time series we are creating 25 models and then selecting one final model. So for 50 time series, it’s 50 times 25. So we have trained so many models. This table here basically shows you each row here is a time series and this would basically show you for each time series which one was the best model. So in this case for store_1_item_1, bayesian ridge was the best model with 3.7 in MAE. For item two, there’s ridge regression, so and so forth.
Again, creating future dataset and scoring it. And then I show you this on plot. This is what it looks like. So store_1_item_1, we have the blue line, which is the actual, the red line is fitted line. And from this point onwards it’s a prediction. So you can see it looks like a good fit. Yep. Okay. Now let’s see, because we are logging this experiment, let’s see MLflow. Local host 5,000. If I go here, you would see for each time series, I have the experiments here. So I have about 50 experiments in total. Let’s go into one of it. Again, under each time series, we have trained multiple models, 25 of them and let’s just click on. So this is the best model for item one, bayesian ridge. The parameters of bayesian ridge, metrics and the artifacts. So you have results which is the same grid as you would see in Excel in your notebook, residuals plot, error plot, holdout score, feature importance and the artifacts itself.
So now at this point, you can actually load this model from this location using load model function and you can do your scoring. You’d also notice there’s a button here called register model which is basically MLflows’ native functionality of serving models. So if you click here, create a new model, let’s say, test1, this won’t work at this point because what’s happening is when you don’t define a URI, which is a backend, by default MLflow uses file system. And when you are using file system, you cannot use this functionality of MLflow. So for that, let me head over to the last demo here which is demo4.py file. Okay, let me zoom in a little bit. So what we have here is a command line script where I’ve created a function called one. I’m using the same insurance data set that we have used in first demo. Here’s your setup function which is passing the target data, log_excrement and experiment_name, and beneath the setup function, I have created a list of four models. So it’s ID for four individual models, linear regression, decision tree, lightgbm, random forest.
And then I’m creating a simple loop list comprehension for create model function. So basically it would train these four models and I’ve done that. Just in the interest of time, you can technically have compare models here instead of using create model. What is different here, the only thing I want you to notice here, line number seven. So before running setup, I’m explicitly defining the tracking URI and in the parenthesis I’m passing sqlite, mlruns.db. So now this would tell MLflow to create a database, sqlite database file instead of using file system. And so you’ll be able to use that functionality of a model serving to MLflow. To run this script, I’m going to open my command line. Go to the same location where this file is, which is dataai2021, directly you can see I have demo4 here and let’s just execute this Python demo4.py. Okay. String linear regression.
Let’s give it a minute here. So what I’m going to do now, once this is completed, we would initialize the MLflow server, using the same command MLflow UI but this time, the only thing different is we would pass this parameter as well, backend store URI and the name of the database. And you can see as MLflow is running, I have this new file created in my folder, mlruns,db. It’s a database file. Okay, now this is completed. Let’s copy this command and initialize the server. Okay. Let me open localhost now. All right. You see we have insurance demo4 and we have multiple runs here. So we have random forest. Let’s go to random forest.
We have our artifacts plots. Let’s go to model. And now this time let’s register the model, create new one. And I would say my first model. Register. So there’s no error this time. And if you notice, click on this, you have your model here, version one, random forest model. So this is now an API that you can use to score your prediction. Before we do that, let’s just go and pick a few other models. So you can see this icon here, which kind of says it’s an API. Let’s go to light GBM and decision tree, and let’s register light GBM. My first model. Let’s register it and also let’s register decision tree as my first model. And now if I go to models, you would see that we have three versions. The first version is random forest, the second version is light GBM and this is decision tree.
I’ll head over to my notebook and this is demo4 notebook. And I’ll just show you how we can use these APIs to score, right? So I’ll just create insurance data set, I’ll remove the target column. And now I’ve created this function which is kind of again, pointing to the same tracking URI. And it’s basically calling the predict function. So let’s run this. Now all I have to do is pass my data. What is the name that I have registered the model from and the version number here, right? So if I run this, you see, you’ve got the predictions. Now this one, because number one model was random forest. These predictions are from random forest.
Now version two is light GBM, so you can see the predictions are also different. And version three is decision tree, right? So you can see now, you’re literally using different estimators and you can now combine them or wherever you want to do with them, right? This is equivalent to if you to go in here and let’s find our random forest model. Oops. This one. If you were to load this model from this file here and then use PyCaret’s functionality you would basically get the exact same answer.
So both are equivalent, the difference is PyCaret’s native functionality is using this as a file system, here you can use it as a published API. All right. So now this brings us to the end of our demo and I would be happy to take any questions in the chat.
Moez Ali is a Data Scientist and creator of PyCaret (An open-source, low-code machine learning library in Python).