Niall Turbitt is a Data Scientist on the Machine Learning Practice team at Databricks. Working with Databricks customers, he builds and deploys machine learning solutions, as well as delivers training classes focused on machine learning with Spark. He received his MS in Statistics from University College Dublin and has previous experience building scalable data science solutions across a range of domains, from e-commerce to supply chain and logistics.
This workshop is part three in our Introduction to Data Analysis for Aspiring Data Scientists Workshop Series.
scikit-learn is one of the most popular open-source machine learning libraries among data science practitioners. This workshop will walk through what machine learning is, the different types of machine learning, and how to build a simple machine learning model. This workshop focuses on the techniques of applying and evaluating machine learning methods, rather than the statistical concepts behind them. We will be using data released by the New York Times (https://github.com/nytimes/covid-19-data). Prior basic Python and pandas experience is required.
Although no prep work is required, we do recommend basic python and pandas knowledge and signing up for community edition before this workshop. If you haven’t already done so, watch Part One, Introduction to Python to learn about Python, and watch Part Two, Data Analysis with pandas to learn about pandas.
– So this is the third part in our introduction to data analysis for aspiring data scientists. Previously, if you all join the first two sessions, it was a three part session, and I’m excited to announce that we actually turned it into a four part workshop series. Today is part three about machine learning, and then we actually have part four is scheduled for next Wednesday.
So part four is going to be introduction to Apache Spark, and I have a link there, but I’ll also drop the link in the chat after I’m done introducing everyone. And then just a call out to that if you’d like to revisit videos, one or part two, those are available in our YouTube channel. So that is a short link to our playlist. So that’s our playlist of online meetups, including workshops and tech talks. So you can access that content there.
Great. So just a call out really quickly. So the best way to get access to our content is to join our online meetup group, which I think most of you are, if you’re dialing in from Zoom, that’s where you would have gotten the link to join us today. That’s the link there. So if you’re joining us from YouTube, I’d love for you to join the group. That’s where you all get all the notifications of upcoming content, we send some messages and different resources there. So please join the group. And then we also are broadcasting live through YouTube, And we do that for all of our tech talks and online meetups and workshops. So make sure you’re subscribed to our YouTube channel, and turn on notifications and that just allows you to be able to be aware of upcoming tech talks and content, and then that also has the link there. So YouTube links at the bottom.
So we have two resources for you today. So one is the GitHub Repo. And I believe these resources were linked in the events page. So we just wanted to call them out again. Most of you probably signed up for Community Edition already. But here’s the link again, if you haven’t already done so.
Sorry, I get some notifications from my phone and I just wanted to check to make sure, I can’t see the chat while I’m presenting. So I just wanted to make sure we’re all good. So I’ll drop those links again in the chat or one of the TAs wall as well.
All right, so I’d love to have our instructors and TAs just quick introduction.
Let’s start with our TAs. Amir, maybe you can start, and then we can go to Kelly and Brooke, and then I’ll pass it along to our instructor for today, Niall. – Hi, everyone my name is Amir Issaei. I’m a data scientist and consultant Databricks. I spend about 30, 40% of my time training different customers to do machine learning in scale and about 40 to 50% of my time is spent on developing solutions, machine learning and AI solutions at Databricks. I’m in the same team as Brooke, and I pass the mic to Kelly. – Hey, y’all. I’m Kelly. I’m a solutions engineer at Databricks located in Los Angeles. I work mostly with startups who are new to Databricks, helping them figure out how they can best design their big data pipelines.
– And hey, everyone. My name is Brooke Wenig. I’m the machine learning practice lead at Databricks. So our team focuses on working with our customers to help them use machine learning, whether it’s building brand new pipelines for them or helping them scale pipelines. And then also working with the PM and engineering team to provide product updates and product feedback. So if any of you have feedback on the product, please feel free to ping me after this as well. – Thanks, Brooke. – Great. Sorry, Niall to interrupt you briefly. I just remembered I wanted to just give a quick shout out to the attendees about utilizing the Q&A in the chat. So the chat, you know, we’ll send links in there, and if there’s any audio issues or whatnot, ping us there, and then please use the Q&A function if you have questions, and that’s where we’ll be answering everything. This session is recorded, and we’ll share all the resources, including the video recording 24 hours after the broadcast ends. So sorry again Niall, and take it away. – No problem at all. Hi, everyone. So my name is Niall Turbitt. I’m a data scientist on our professional services and training team. I’m based out of London in the UK. So very similar to Amir, my time is kind of split half and half between working with our customers to architect, develop, and deploy machine learning solutions at scale to our customers, and then on the other side of things, delivering trainings, which focus primarily on data science and machine learning. So, super excited to take you through a little bit of thought today. What I would say is, pretty much don’t be shy with the questions, put them into the channel and we can address those at the end as we come to them.
So I’m just checking if we still have Karen.
– Yeah, we’re all set now. You can go ahead and get started. – Okay, perfect. With that in mind, so I’m just gonna give everyone a couple of minutes and to get set up on Community Edition. So I’ll first share my screen. What I’m going to do is exactly walk us through what we will need to do in terms of getting set up on Community Edition setting up a cluster itself. And then once we have that, what we will be doing is then importing our notebooks into the Databricks environment itself. So what I will do is like I said, give everyone a few minutes to firstly sign up. So if you haven’t already, the link is in the chat, so what we want to do is go to databricks.com/trydatabase, which will bring you to this page itself. So if you if you haven’t signed up already, it’ll take a minute or so just to register and what we want to do is get ourselves into this Community Edition. And what Community Edition is, is effectively replicating Databricks workspace whereby we can spin up our clusters, all these limited resources to what we would typically have. And then what we can do is spin up our own cluster and then bring in the notebook that we will be using today and to explore kind of first steps with of machine learning with Scikit-learn.
The other thing I would say is if you are already set up with this, what you will be doing then is going to community.cloud.databricks.com where you would then be prompted to sign in with your login. So like I said, I’ll just pause for a minute or so, let everyone get signed up, get logged in, and then what I will do is take us through how we actually will spin up a cluster. And then I’ll take us through step two, that’s whereby we’re going to load in the notebook that we will be using for today. And if there are any issues, please do direct them to our TAs, if you’re running into any issues or difficulties regarding the first step that we’re going through at the minute.
So once we do have that and once we are in to the actual workspace itself, our first step is actually going to be setting up a cluster. And what we say when we’re spinning up a cluster, is effectively spinning up a resource in a data center somewhere effectively, and what we’re going to do is install our Databricks Runtime which will allow us to connect and upload tools. So to do so, what we’re going to do is navigate down to the clusters tab on the left hand panel of the welcome screen. This will be present throughout whenever we’re navigating through Databricks itself. But once we get to here, what we want to do is create our cluster. So going to create cluster, and what you can do is give your cluster a name, it doesn’t have to be anything in particular. I’m just going to give it my own name, so Niall.
In terms of Databricks Runtime version, we’re just gonna leave this as is. So Runtime is more specific to the things that we’re installing on our cluster when we spin it up. So on that virtual machine or kind of resource in a data center, that we’re spinning up the software that we’re effectively installing on it. We’re just going to leave this as the default. So Databricks Runtime 6.4. And pretty much that’s it, we’re going to leave everything as is. Once that is done, we’re going to create our cluster. So all you have to do at this stage is, give your cluster a name, leave everything else as is, and click Create.
And what we’ll see is that is going to take a couple of minutes just to spin up.
So once we see this spinning wheel, we’ll see our status pending. Now, as we wait for our cluster resources to be acquired, what we’re going to do is take a step through how to actually import our notebooks that we’re going to be using today. So as we said, let’s give this a moment or two to spin up, I’ll allow everyone just a minute or so to create that cluster. As I said, I don’t want to lose anyone at this stage, so pretty much take your time to set this up, and then I can show how we can go by importing the notebook that we will be using for today.
So as we’re waiting for our clusters to spin up, what I’m gonna direct us to is the GitHub Repository where we have today’s notebook that we will be stepping through.
So this should be posted into the question and answer channel. But where it is located is under our tech talks, and then we’re going to go into today’s tech talk. So April 22nd, Machine Learning With Scikit-learn, and what you’ll find is we have this Machine Learning with Scikit-learn IPython notebook. So .ipynb, stands for IPython notebook. This is pretty much a universal format with regards to data science notebooks. And what we can effectively do is import this notebook into Databricks itself and connect it to your cluster and then start by running through the actual notebook itself. So, very simply, all we actually need to do is copy the link. So, I’m very simply just going to go into this IPython notebook, copy the link, navigate back to my Databricks community environment, navigate to home. So you can think of this as a file system. So we call this the Databricks file system. Navigating to home, I see that I’m in my home directory as it were. And if I then right click, I can import, once I have the option to import from file or from URL, at this instance, I want to select the URL. And very simply if I have the given notebook here in GitHub, what I can do is paste in the URL and simply click Import. And that might take a second or three just to actually import this notebook for today.
So again, I’ll take a minute or so to let everyone catch up with that just to make sure that we’re all following along, and if there are any issues or questions at this stage, please do direct them into the chat, and let us know if you are running into any trouble with that.
Perfect. With that in place, let’s start to actually walk through our notebook. And a little bit of justification for what we are doing today. So part three of our introduction to data analysis webinar series, and what we effectively want to do is, having looked at data analysis, in particular with Pandas, and then also our introduction to Python, were very much armed with the skills to move on with the next steps in that machine learning pipeline as it were. And what we’re going to do today, is use something called Scikit-learn. So Scikit-learn is probably one of the most popular open source machine learning libraries, amongst data science practitioners.
What we will be walking through in this, is basically setting the scene as to what machine learning is, we’ll be exploring kind of what the types of machine learning are. And what I really want to arm you with is the ability to be able to identify problems where machine learning could be applied, and very much what types of machine learning you could apply there. We’re going to address certain concepts, such as the training Test Split, so very much a pivotal thing to understand whenever we’re speaking about how we train machine learning models. We’re gonna use Scikit-learn in particular, and we’ll dive a bit more into what Scikit-learn is and again, why we’re using it, and then also specific techniques that will we’re going to have to employ. So something called One-Hot Encoding, and how we deal with data that isn’t numbers and how we feed that into our model itself. And we’re also going to be looking at something called pipeline. So how we chain together multiple transformations on our data, and then finally checking how well we did whenever we build the model. What we want to do is discuss evaluation metrics, but specifically for our use case, what can we use?
Throughout this, we will be referencing the docs from Scikit-learn and any other docs in particular, and simply to provide you with a bit of context as well as to why we’re using these techniques. And lastly, the dataset that we will be using today, is relevant for, I guess, the time that we’re in, and so it’s gonna be the New York Times and COVID-19 data. So, as well as arming you with the ability to use machine learning, also showing you and demonstrating how we have the ability to apply it to very relevant topics today.
So one disclaimer as well, before we do start off, we’re exploring with linear regression today. I’m starting very simple.
The one caveat I would have around this and especially fitting to the data that we have in particular is that, linear regression may not be the most suitable algorithm for the dataset. However, what we are seeking to really solidify, is the ability to apply machine learning to a given dataset, and it’s more so hard to use the actual Scikit-learn library, and I hope we can go about reading data and with Pandas, how you can then fit a model using Scikit-learn.
So with that, I wanted to just take a couple of steps back and put a definition on machine learning itself. And I think there’s a lot of mysticism, especially from kind of the press or on what machine learning is. And then also a higher level kind of what is artificial intelligence, you hear terms such as deep learning as well. Let’s take a step back and differentiate between what these terms actually mean. So machine learning in particular, I’d like to roll back to this very succinct definition of fact that machine learning is effectively employing an algorithm that learns patterns in your data without being explicitly programmed. So effectively, we’re going to be utilizing machine learning algorithms and providing data to them, whereby we’re not going to supply explicit rules as to what we should do there, but more so we’re leaving it to the actual algorithm itself to uncover structured data and learn those patterns. So effectively, we’re providing data to our algorithm, which is effectively a function that’s going to map our features to an output. And in our case today, some form of prediction.
So with that, where does machine learning fit into the whole landscape of artificial intelligence, deep learning. So artificial intelligence, I guess, we can think of kind of a catch all term that’s going to refer to any kind of computer program that automatically does something. People often use the terms interchangeably, AI, ML, and deep learning.
But we can think of them as separate entities, as it were. In the sense that artificial intelligence is more of a broad term and effectively encapsulates any technique, which enables a computer to mimic human behavior. Then within that we can think of machine learning as a subset of AI consisting of more advanced techniques and models which then enable computers to derive structure, as I mentioned, from the data itself.
Machine learning, I guess, like I said, is the science of learning patterns from your data and not explicitly programming.
And a further subset then off machine learning, is deep learning, and deep learning has gained a lot of press in recent years through its ability to deliver really high accuracy on tasks such as object recognition, speech text detection, speech recognition, I should say, language translation, amongst a host of others.
Deep learning in particular is where we use that certain subset models and off the field of machine learning there are things called, multi layered artificial neural networks. We will be touching on that today though, and start off pretty basic.
In terms of our types of machine learning, so effectively, what I want us to do is, like I said, be able to discern what problems we should be applying certain machine learning techniques to.
In terms of types of machine learning, broadly, we split it down into two certain paradigms of how we approach certain problems. And with this, what I mean is whereby our data has labels sometimes, our data sometimes does not have labels, and then other approaches that we can sometimes use and sort of flesh that out a bit, one such approach or one such area of machine learning that I wanna start around is something called supervised learning. And effectively, we can break this down into regression problems and classification problems. Very simply in a regression problem, what we’re trying to do is predict results that are within a continuous output, and essentially meaning that we’re taking our input features as our data, and what we’re trying to do is then map those variables or those features to some continuous output. An example of this may be trying to predict the house price, given the number of bedrooms, for example. So in this instance, what we would say is we have kind of one feature, which I mean, if we were to plot it, it might be the number of bedrooms, if we just ignore the axes here for a minute. So maybe number of bedrooms, and one would assume that maybe house price increases as then the number of bedrooms increases. But here, what we’re doing is we’re saying we’re taking on one feature, which would be the number of bedrooms here, but our output, our target feature, would be the price of that house maybe. And not to be confused today is the fact that we’re using linear regression. Linear regression is just a subset of the overall regression, supervised learning problems that we can have. And something to remember in particular, is that a regression problem is simply any problem or simply any supervised learning problem whereby our target is going to be a continuous variable. A classification problem, and is effectively where we’re trying to predict results which lie in a discrete output.
So in this instance, for example, what we would be seeking to do, and particularly I guess, we could flesh that out with a given example, is given some images of cuddly dogs and cuddly kitten, can we predict whether the image is a dog or a cat? So in this instance, we’re feeding in our data, and the output is going to be some discrete category.
In this instance, if it’s one of two of outputs, it’s something called binary classification, whereas if we have more than two classes here, it’s called multi class classification. On the other end of the spectrum, we have something called unsupervised learning. So in contrast to supervised learning, whereby we have explicitly labeled data, so if you think about for our house price example, if we give a number of bedrooms, we’re supplying examples whereby we know the price, and then what we’re seeking to do is then predict the price of a new instances or where we come through with new data and have number of bedrooms can be then predicted any price. In this instance, for our classification supervised learning problem, if we’re given a new image, can we then predict if it’s a cat or a dog? However, with unsupervised learning, it’s more, well, it is on the spectrum of an instance whereby we don’t have data with explicit labels.
It’s effectively an approach whereby we’re trying, again, to derive structure from the data whereby we don’t necessarily know the effect of our variables. And one example might be, if we have data about a set of customers, for example, and we have data regarding their demographic metadata on their spending habits, and what we want to do is cluster those customers together. What we can do is employ certain unsupervised learning methods to create those clusters and effectively distill some meaning from the inherent structure in the data itself.
One small subset or increasingly larger subset off the case machine learning landscape, is something called reinforcement learning. So reinforcement learning has gained a lot of traction in recent years, you may have heard of some regression that has been made on the likes of AlphaGo and how underlying that is something called reinforcement learning. At a very simple level and what reinforcement learning seeks to solve or the problem that it sets up, is essentially figuring out how over time do you maximize a long term reward?
So it’s effectively about programming an agent, is the thing that we call them when we’re programming this agent and to take actions to then maximize a reward given a particular situation. So this agent interacts with the environment, it takes some kind of action, it receives some feedback from that environment as to how well it is doing. And the chief aim of this agent over time is to take those actions so as to maximize that reward signal. And for today, we’re not going to be diving straight into reinforcement learning, but we are going to be starting with our first regression problem on our first supervised learning problem. So what we will be doing is, firstly, importing our data, which is the New York Times and COVID-19 data, and we’re going to use a linear regression model to predict the number of deaths resulting from COVID-19.
What it will say is, this dataset is already available on Databricks itself. So all you have to do is, once it’s up and running, attach to your cluster. So you should see a green circle once your cluster is up and running. And the first cell that we will be running is, this percent FS, LS. And let me just run that and then I can explain to us what exactly this command is doing.
So as we let that run, and this New York Times dataset is available as a Databricks dataset, meaning that, and once you’re in Community Edition, or at Databricks workspace, we can import this US States CSV file and start utilizing it. So this percent FS is simply a command that allows us to navigate through our Databricks file system, as it were. We see that we have this COVID-19 data directory, in which we have our US States CSV. So before we go ahead and load that in, I just want to do a bit of an aside as to what this dataset is. So we see at the top that we have our github.com/newyorktimes. I’m just going to open this up in a separate tab.
So at a brief level, the dataset that we’re gonna be utilizing is, as I said, this New York Times COVID-19 dataset, and essentially what they’ve done is track the number of cases and deaths of Coronavirus, and for each state and each county as it were in the US over time. And so the thing that we’re going to be doing today, is seeing how we can use this data along with using Pandas and Scikit-learn to build a linear regression model on this. The data itself, and we’ll see this when we actually go to import, comprises of the dates. So the given dates that the number of cases and the number of deaths are recorded, the state in which we have that recording, we’ll see this FIPS, we won’t be using this column, but just a disclaimer that it’s effectively a geographic indicator as to the region itself.
For future work, it’s something that you could potentially use to join with other external data sources as well. So being an indicator unless you want to join with the data that we’re using. The number of cases for a given day in that state, and then the number of deaths as well. So going back to my notebook, what we’re gonna do is import our reading, this CSV file. We’re gonna be using Pandas to do so. So if you followed the Connor’s webinar last week, we see that, well, we know that we can do import Pandas as pd, and we can read in the CSV file simply by saying pd.readcsv. Df.head then will give us the first five rows of this dataset. And just note that this is zero indexed, so zero, one, two, three, four. And we see that we have the date, state, the FIPS cases and deaths. If we want to see how many rows and columns we have, we can quickly get that by calling df.shape. So we see that we have 2385 rows and five columns.
Before we actually dive in to building a model of this, let’s do something called some exploratory data analysis. So uncovering certain patterns in the data and just getting basically a feel for the data itself. So the first thing that we’re gonna do, is checked out the relationship between cases and deaths. See if there’s any kind of relation there, one would expect it.
Initially, what we want to do, is just filter to a certain day. So I’m gonna filter to April 14th. To do so, what we’re gonna do is grab our df, so from above our data frame, and then subset the date column to April 14th.
What we’re then going to do is plot this. So we have a data frame, and what we can do is call .plot on this and supply our x axes for our plot, our y axis as deaths, the kind of plot that we want to. This is just going to be a simple scatter plot. The figure size, and then lastly, we’re going to give it a title as well.
If I run this is, so what we may need is the matplotlib inline there.
So yes, what you actually may need, if that’s not running for you, is this simple command. So %matplotlib inline.
If that doesn’t work for you, do reach out to your TAs and they can assist you with that. This is essentially just allowing us to plot this. So just taking this apart actually and analyzing this. As we see, and to be expected, we see that the number of cases or as the number of cases increases, the number of deaths increases. We see explicitly, and we have significant outliers with New York and New Jersey. Again, to be expected, given the recent news of the high number of cases and deaths in these States. In particular, what we’re going to look at is our dataset without these outliers. So what happens with our plot if we take our New Jersey and New York? To do so what we’re going to do, is supply these two conditions. So supply dfstate where it doesn’t equal New York and dfstate where it doesn’t equal New Jersey.
Just having a look at the head of this again, I mean, we can’t really tell that they aren’t contained in it, but what we can do is actually plot this. And it’s pretty much the same cell as we saw previously. What we’re going to do is, again, filter down to April 14th, I’m calling the same plot functionality. So plotting cases against deaths, a scatter plot.
And then what we’re going to do is ask this part, New York and New Jersey as explained in the title. And in terms of this last line, the thing that I should call out that is doing, is effectively applying these labels to each of the data points. So taking cases, deaths in states, calling .apply, is effectively the same for each row and unpack these different labels.
So as we see, and without New York and New Jersey, we can get a better feel for what the data looks like. We see that there are high number of cases and deaths in Michigan, CMS, Louisiana as well.
Lastly, what we wanna do is look at New York versus California. Let’s see how they compare in terms of COVID-19 deaths. Initially, what we’re gonna do is create a data frame whereby we create or we subset to where the state equals New York and filter to where or filter to where the state equals California. We also want to then pivot our data, and this is so we can then plot it. So what we’re going to do here is use this as our index to our data frame here. And then what we’re going to do, is have state as our columns, so our top variables, and then values. Fill in a is basically saying if there are any null values, fill with the value zero.
So as we see, starting from January 25th onwards, there are no deaths. And we see then that they swiftly increase pretty much from March 17th, 18th onwards. And we see a rapid increase there. To actually plot this up, let’s call the same .plot functionality on our data frame and this time and make it a line plot.
Yep, so I just see another question as to what this apply lambda is? This is effectively unpacking the states and labeling them on our graph itself.
So as we see here, as we go forward in time, we see that there are increases over time. California in terms of the number of deaths, sorry, in terms of the number of deaths, increases pretty slowly, but we see an exponential increase in New York as well. Again, to be expected given the news as well, but we see there’s a great disparity in the growth rates of deaths there.
On from our exploratory data analysis, let’s get into doing some machine learning.
So one of the core pieces of theory I asked that we’re gonna have to do with regards to our modeling is how we split our data. So we do something called a Train-Tests Split, looking into the rationale of why we do this. So we start with a full dataset, and this is for our supervised learning problems, whereby we have labeled data, we’re gonna start with our full dataset. We’re then going to split our data up into a training set and a test set. This training set is the set of data that we’re going to use to fit our model to. So our model is going to learn patterns from this data of which we may have, so n by d, n being the number of rows, d being the number of features that we have. And each of these instances, or each of these inputs, has a given label. What we’re going to do is then fit our model and then what we’re going to do is assess how well our model does against this test set. So this test set is to see how well our model generalizes beyond just the training set as is. So it’s all done on good fit into your training set, but what we want to see is how it performs in the world, essentially.
So having fitted our model to our training set, we then make predictions against our test set, and assess how well we did. And this is where we get our accuracy, predict for our given test set, see how well our predictions compare against the actual labels that we have for this test set.
And then that’s how we basically iterate and update how we do our model, how we improve our model, we can then take our model once we have satisfactory accuracy there and predict on new entities that come in. Something to note here, however, is the fact that what we’re dealing with is something called temporal data. Temporal data effectively mean a time based data. So time series data, in the sense that we’re doing day by day.
What we want to do here is instead of doing a random split, so randomly choosing a proportion of data for our training set and a proportion of data for our test set, what we want to do is we’re going to use all data from March 1st to April 7th to actually train the model. And then we’re going to hold it as our test set, April 8th to April 14, as our test set. And this will effectively allow us to assess how well our model has done. So in order to do that, what we’re going to do is, is going to create a train data frame and a test data frame. So taking this original data frame, we’re going to filter to greater or equal to March 1st, and less than equal to April 7th. And then for our test set, we’re going to grab any dates and after April 7th. So we would have effectively, April 8th to April 14th.
What we then do is create what’s called our x-train data frame, which will be this set of features, so this n by d features, and then y-train, which will be the labels for that training data. We’re then going to do the equivalent with our test set, whereby we have our cases and then our label or the thing that we’re trying to predict, is deaths. So can we predict the number of deaths, given the number of cases?
So as I said, we’re going to start pretty simple in terms of the model that we’re using, and we’re gonna use something called linear regression. Linear regression is pretty much the first go to model that you typically learn as a data scientist. It’s super powerful and super popular, but it’s also fairly simplistic in its assumptions, but it allow us to fit a pretty good model to certain use cases. In this instance, what we seek to do, is effectively fitting a line to our data. What this looks like mathematically, is something that looks like this in terms of y hat.
So any time that you see a hat or a certain symbol here, it effectively means a prediction. So can we predict or generate a prediction according to this equation? So, W sub zero or subset of sub zero, meaning our weight for where we go through this intercept, plus our weight for our feature.
And what we’re effectively saying is that we can, or we want to get close to y. So the actual true value of the deaths, given our prediction, and some error term. So effectively what we’re doing is seeking to fit a line through our data, and please note that this is not the data that we’re actually fitting to, but purely an illustration of how linear regression works. But we’re effectively fitting a line through the data such that it minimizes the distance between every single data point and the given line. Meaning that whenever we see a new instance come in, all we say is, well, we use this line to predict so for something where x is 0.4, we’re then going to look across to where y is and predicted that y value. So here, what we’re gonna be using is Scikit-learn to fit this linear regression model. So I’m just going to show us the Scikit-learn, linear regression model documents. So here we’re fitting an ordinary least squares, linear regression, ordinary least squares effectively saying we’re minimizing the distance between those points on the line. And what we see is that we can pass in these parameters, parameters such as fit intercept, normalizing. I’m not going to worry about these just yet. So what we can do is specify the defaults as they are, so the default parameters, and what we’re gonna do is after importing linear regression from sklearn.linear model, we say linear regression with the parentheses, effectively saying use the defaults, and then calling .fit on x-train and y-train.
Then what we’re going to do is print, oh, sorry, I forgot to run my cell above, and below. And what we see is number of deaths, in this instance, our line equation. So having fitted a line that is minimizing the sum of the square distances, what we’re going to say is, the line that has been fitted is off the equation, minus 8.9911 plus 0.0293 cases. So let’s unpack that and discuss what that means. It’s effectively saying that when I put, if cases are zero, the number of deaths is effectively less than zero, which doesn’t make much sense. So what we can do is set our intercept to be zero, which effectively means that we force the line to go through zero, which intuitively, in this instance, makes sense. If we have zero cases, then we will be predicting zero deaths.
So let’s fit it again, although at this time, we’re gonna say fit intercept false, and then fit again to our x-train and on our y-train.
And let’s see what our equation then looks like. So the number of deaths is going to be .029 times number of cases. So how do we interpret that? We’re effectively saying, as a unit increase in case, sorry, if we increase the cases by one unit, on average, we’re increasing number of deaths by .029. So, in effect, this is implying a 2.9% mortality through dataset. What we do know is that some states have higher mortality rates than others. So what we can do is actually include state as a feature.
However to do so, what we need to introduce, is something called One-Hot Encoding. So in essence, what we need to feed in to your model, is numeric data, however, our state data is going to be non numeric data. So one way of handling non numeric features, is something called One-Hot Encoding.
In order to actually input data into our model, like I said, how do we actually do One-Hot Encoding and what can we not do here? So if we think about a very simple example, whereby we have the states New York, California, and Louisiana, can we simply just encode these as New York equals one, California equals two, Louisiana equals three? In essence, what this does is implies that California is two times New York, and Louisiana is three times New York, we’re in effect introducing a spurious relationship between these inputs. So one idea is to create a dummy feature. So basically a dummy feature is, a binary one or zero whether for that given instance, or that given row, whether that instance applies to that state or not. So for New York here, what we would have is 100, California would be 010, and Louisiana would be 001. And again, this is something called One-Hot Encoding. So what we can do is One-Hot Encode our states. So let’s try by firstly, importing One-Hot Encoder. We’re going to take both our training and test data frame, we’re gonna say use cases and state, so predict the features that we want to feed in that we’re going to fit our model to, and then we’re gonna define this One-Hot Encoder. So handle unknown is basically saying, if when we go to fit this again, and we come across a state that we haven’t seen before, ignore that. So essentially, in this instance, it will be if we don’t see anything that is New York, California or Louisiana, ignore that instance. We’re also going to set this sparse to be false. This is purely from a consideration of how you store sparse values. So for example, if we have many, many zeros, so if we had many states, and in effect we would then have many zeros, what we could do is optimize the way we store this data by only storing the nonzero values. So we’re gonna take our defined One-Hot Encoder, and we’re going to fit that to the train set and then transform it to a train set.
So once we have done this, what we see is we have this binary encoded data, well, sorry, One-Hot Encoded data, but we see that it is in terms of a binary manner. So we have one and then zero for the rest of the entries. Let’s just quickly check the shape of that though, and see if it makes sense.
So what we see after we have fit to our training set and transformed to your training set, we see that we have 1754 rows, however, what we have are 914 columns. So something isn’t quite right there. There certainly aren’t 914 different states.
So what we’re gonna do is have a look at the categories here. And something that stands out immediately is the fact that we have these numbers.
And only then do we have the states as a category.
So what’s happening here is it’s not just the states that we are One-Hot Encoding, but then also the cases as well. So the other feature that we have specified.
What we want to do now is actually use something called, a column transformer, where we can specify only one column to use or to basically One-Hot Encode in this instance. So here, we define our column transformer by giving it a name, so just any name would do here. We’re going to then specify what the actual object is. So enc is this One-Hot Encoder from above.
And reminder equals pass through, this effectively means that any other columns that basically is not state, just pass those through and don’t apply to. So column transformer .fit transform to our train, or x-train and check the shape, and we see that we have 56 columns now, why is it 56? We also have territories in here as well, so Virgin Islands, for example, Northern Mariana Islands.
So it’s not just states in there as well.
So putting this all together, what we can do is, create what’s called a pipeline. And what this pipeline allows us to do, is chain together a series of these different transformations. And what it also allows us to do is ensure that anything that we do to our training set, we can also do to our test set. So in our pipeline, what we’re gonna specify is our column transformer and then our linear regression model, we’re gonna fit that to our x-train and y-train. And then we’re gonna create a prediction to this pipeline model .predict on x test.
And then we can have a look at how we’re performing.
So we see that we have now a fit, a new model. And in essence, we now have a different coefficient for our cases and feature. If we actually have a look at this, and look at the coefficients then for our different states, and the way that we can interpret these, it’s effectively if we have a positive, so the effect for Louisiana, is significantly higher than lower. So as this coefficient increases, it effectively indicates that the number of deaths is increasing there. I’m saying that with all other variables held equal.
Some things that you stand out here in terms of New York being negative here, and we see that California also has relatively negative.
So ultimately we need to assess how well our model is doing in terms of these predictions.
With regression models, and in particular, supervised learning, what we want to do is assess how well our model is doing with regards to unseen data. And something called RMSE is what we use or what we can use to assess it. So effectively what RMSE is, is the root mean squared error. So please don’t let this math make you turn off just yet. If we break this down bit by bit, it allows us to show that, it’s a good way of answering the question of, how far off should we expect our model to be honest in next prediction? One way to assess how well our model is doing, is to first take for every single label that we have. So yi, we’re gonna subtract the prediction for that given instance. Then we’re going to take the squared error, whereby we again subtract the prediction from the actual value, square that, we then sum those up, we divide by the number of instances we have, and then take the square root of that. So what we can ultimately do, is import the mean squared error. We can then calculate the square root of that to get our mean squared error and RMSE. And what we’re effectively saying is that, kinda, on average, how far off should I expect myself to be on my next prediction? Just very quickly, visualizing our predictions themselves to see how we did, what we can do see that we have our predicted deaths here and our actual deaths here. And we do this by concatenating together our two data frames and getting our predicted deaths together as well. So I know I’m running close to the time. So I do wanna stop just now and then any questions come through if we do have time for them. But hopefully you’re able to follow along. Hopefully you can take away some things here. Lastly, if you do want to apply some of these methods, do check out Scikit-learn in general, but also datasets that you can openly use from the UCI ML Repository and then also Kaggle as well, which you could then actually earn money from as well. So thank you. And if we do have time, I’d be more than happy to take questions.
– So now one common question that came through from few folks is, why would you use the One-Hot Encoder from Scikit-learn instead of a pd.getdummies? – Sure, so if I just scroll up to it, with the our implementation here, so the One-Hot Encoder, what it allows us to do is handle unknown. So basically, with get dummies, what you’re doing is applying it to data frame, and it’s applying that to that given data frame. Here we have the option to actually handle unknown features if they come through and we can also introduce this aspect of storing a sparse vector as well to decompress the space that we actually store those values to.
– Hey, Karen. I’m not sure how much time we have left for other questions since I know we’re at the hour. – Yeah, I just posted in chat and, you know, if folks need to drop, no worries. I think we have a few minutes, we can take a few other questions, and people could drop if they need to leave. We’re recording, so if there is some more questions that you all feel make sense to answer, let’s take few minutes if everyone has a few extra minutes.
– All right, there are quite a few other questions about how Spark and Koalas fit into this whole picture, and MLlib versus Scikit-learn, if you want to take a minute to answer some of the single node versus distributed questions. – Sure. So what I would say with regard to, I mean, if we unpack that a bit, so what Koalas seeks to do, is effectively use the Pandas API, but on top of Spark. So a lot of what we have done here with regards to our data processing with Pandas, you can affect the port over to Koalas. So koalas, just for a bit of context here, and is an open source package that sits on top of Apache Spark. So what we’ve been doing here is all on single nodes. So what we say with regards to kinda single node is that this isn’t distributed in any way. If you were thinking about using a large amount of data, what you effectively could do, is distribute that data across a cluster. So many different virtual machines as it were, and this is effectively what Spark is doing. Koalas then allows us to use this exact same syntax, but under the hood, what’s going on, is that it’s utilizing Apache Spark to distribute that data. So with Koalas, you can do really large scale data processing, and especially if you have existing Pandas code, you can very quickly convert it to Koalas. Simply, instead of import Pandas as pd, you could do import Koalas as ks. And then it would be a simple change of one to one mapping of the actual API itself.
– Great. And so a few other folks asked questions about the difference between Scikit-learn versus MLlib, if you want to discuss a bit about that. – Yeah, sure. And it again comes to what I was saying with regards to a single node setting versus a distributed setting. So Scikit-learn is all single node based in the sense that we’re working off one machine. So even whenever in the context of Databricks, we’re setting up a cluster, whenever we run Scikit-learn, in this essence, we’re only running on the driver so it’s not distributed in any way. Spark MLlib is inherently distributed. So there’s underlying algorithms whenever we call MLlib models. So the Machine Learning Library for Spark, everything that is done there with regards to fitting data is done so in a distributed way, versus Pandas and Scikit-learn, which has done so in a single node setting. – Awesome. All right, and then there’s still quite a few questions about the One-Hot Encoder, if you could go back there. – Yeah, sure. – One of the questions is about what the ignore parameter does, and the other one about sparse equals false, they thought they were only supposed to see ones.
– So yeah, handle unknown. So effectively, we think about this in the context of our training and test set. So whenever we were fitting this One-Hot Encode here to our training set, there’ll be a certain amount of, in this case, states that we see.
What can happen is when we go to fit this to our test set, what happens if there is a state in there that we haven’t seen before, we need to tell our One-Hot Encoder what to do with that. So what we can do is effectively tell it to error, so presented an error if that happens if whenever you go to fit it to our test set, it sees something that hasn’t seen before, or else we can handle that by simply saying, ignore, so we just don’t treat that row, and don’t use it. On this sparse equals false, so if we had sparse equals true here, effectively, what would happen, is how we represent this array. So sparse equals false, meaning that we’re not going to stored as a sparse vector, a sparse vector being, as I said, whereby we only store the values and the indices of the nonzero values.
– Perfect, and then somebody else was asking a question about how you can go back and forth between Spark and Pandas without Koalas.
– Between Spark and Pandas? – Uh-huh. – So yeah, if you have a Spark data frame, it’s very simple, you call .topandas, and that will give you a Pandas data frame. The one caveat I would have around that is being careful with the amount of data that you have in your Spark data frame and what you’re trying to then convert to a Pandas data frame, in the sense that if you have a Spark data frame, distributed across the cluster, whenever you call .topandas, what effectively is doing, is that you’re bringing all that data to the driver, so you could effectively crash your driver if you don’t have enough memory.
– Yeah, one other thing I’d like to add on top of that too for me, is you can always save your data out to disk, take it right out to a CSV file, and then you can use Pandas to read it back in or Spark to read it back in. – Yeah, great point. – All right, and then one last question is about the column transform. It seems like folks had a little bit of confusion about how the pipeline works and how it applies to the test dataset, if you want to cover that again. – Yeah, of course. So what we have done with regards to our column transformer, is we’re saying that, so if I go back up to enc, is effectively take our One-Hot encoder, and what we’re going to do is apply it to state. So whenever we apply our, well, whenever we call fit transformer, or fit transform, using our column transformer to our x-train, what effectively is happening, is that it’s looking for a state column, and then it’s going to apply this One-Hot Encoder shapes. So what we do, and then our pipeline, is effectively these stages. So our first step is this column transformer, in which we’re going to apply One-Hot Encoder to state, anything else, we’re going to also pass through as well. So the fact that we also have cases as a feature there, we’re gonna also keep, but we’re only going to apply the One-Hot Encoder to state. It effectively allows us to apply One-Hot Encoder to only one variable.
– Perfect, and then I think there’s a question on how to apply it to the test set. – Yes, and then applying it to, I mean, it would be just the same in terms, well, it would be calling .fit to your test set. So if we have our pipeline model, you can then do .predict on your test set, and what this is going to do is apply these same steps to your test set. So we have our fitted pipeline model that’s fitted to the training data and the training set. Whenever we call a predict on our tests set, what’s effectively happening, is we’re running through this pipeline. We’re going to apply this same column transformer to your test set, and then apply predictions, given our fitted linear regression model.
– All right, Karen, do you have time for more questions, or should we wrap up here? – How about one more question and then wrap up? – All right, one more question is what does fit_transform do? So I think there’s some confusion about .fit versus .transform. – Yeah, of course.
If we compare to, above, so where we call .fit to our training set, and then .transform, .fit transform is effectively combining these two steps together. So in this instance, what we’re doing is both fitting and transforming the actual actual x-train here. So the resulting output of this is going to be a data frame of shape. So we have our rows, but then, see if I actually called, show us what it looks like, we would have an array. So effectively here, we have in place both the kind of our number of cases, but then also are One-Hot Encoded columns as well. So it’s done in place, basically, it’s both fitting and then transforming to the x-trains there.
– Great. Well, that’s all the time we have for questions. Karen and Kelly, do you want to promote the next workshop session on Spark? – Sure, yeah. So thanks everyone for joining us. Our fourth of the four part series is coming up next Wednesday on the 29th, same time, 10 a.m. pacific time, we’d love for you all to join. I’ll send you links to everything, it looks like Kelly already put it in the chat. Thank you. So I hope you can join us. And thanks, Niall, a great presentation.