Conor Murphy transitioned to the tech sector after spending four years leveraging data for more impactful humanitarian interventions in developing countries with a focus on business development. He managed a multi-million dollar portfolio of grants for The Rotary Foundation focusing on developing and analyzing impact measurements in economic development initiatives, evaluating program participation and translating academic research into institution policies.
He has held a variety of positions including a faculty role for Galvanize’s Data Science graduate program, principal data scientist and consultant for a number of startups and a data scientist and educator at Databricks. Outside of data, Conor is an avid skydiver who’s always looking for geeky ways to quantify his time in freefall.
This workshop is part two in our Introduction to Data Analysis for Aspiring Data Scientists Workshop Series.
This workshop is on pandas, a powerful open-source Python package for data analysis and manipulation. In this workshop, you will learn how to read data, compute summary statistics, check data distributions, conduct basic data cleaning and transformation, and plot simple visualizations. We will be using data released by the Johns Hopkins Center for Systems Science and Engineering (CSSE) Novel Coronavirus (COVID-19) (https://github.com/CSSEGISandData/COVID-19). Prior basic Python experience is recommended.
Although no prep work is required, we do recommend basic python knowledge. Watch Part One, Introduction to Python to learn about Python.
This is our part two, so data analysis with pandas.
And next week we’re gonna have the third workshop on machine learning and that’s next Wednesday, the 22nd at 10:00 AM same time as this week. And then also just to call out for part one, that was recorded last week. That was Introduction to Python and Databricks and the video is available through our YouTube channel. So that’s a kind of shortened link there. I’ll also drop it in the chat for everyone to easily access.
So just a quick note for everyone joining us,
the best way to stay up-to-date on all of our content, all of our workshops, our interviews with different engineers, developers, and also Tech Talks is going to be joining our data plus AI on my Meetup group. There’s a link to that, again, in the followup materials, I will also link that as well. So don’t worry about trying to copy it down right now. And then also, I know that there’s a lot of folks joining us from YouTube, which is really awesome and really exciting to see. So just a note there to subscribe to our channel and to make sure you’re turning your notification. So what we try to do is at least have a couple of days before I’ll post the upcoming live for whether it’s a Tech Talk or a workshop. I’ll post that so you can set a reminder for yourself if you’d rather watch it through YouTube.
And so here’s two links to prep everyone. If you joined through our Meetup group, you should have received a message yesterday for me encouraging you to sign up for Community Edition in advance. So while I have you all on the intros, check out those links and that’s what you’re gonna need to be prepared for this workshop today. I think we have a little time just to do some, if there’s any issues doing some troubleshooting with the TAs, but sign up for Community Edition. I think maybe a lot of you already are if you joined us for the workshop last week. Just to call out, these are the two links to check out for today’s workshop. And then just a note, well, before I let our TAs and Instructor Conor introduce themselves.
So this is recorded, so don’t worry about if you would like to visit the session at a later time. We’re happy you also got all the resources. If you join through Zoom, you’ll get an email within 24 hours that has again, the links to everything that you need. So the link to the video recording, that GitHub link, the link to the Meetup group, also the YouTube channel so you can view everything upcoming and then also kind of check out what we have coming up.
Some folks are using the chat just to kind of call out what schools you’re from, where you’re dialing in, which is awesome. If we could direct questions to the Q&A feature, that’d be perfect. And that’s where we’re gonna have our TAs answering questions. And then if there’s any time at the end, Conor and Brooke and Chenglyn will also answer a few questions that they think might be helpful for the entire group. So without further ado, Brooke and Chenglyn, if you could just do a quick introduction and then we’ll go ahead and get started here.
– Sure, my name is Brooke Wenig. I am the machine learning practice lead and I’m super excited to get to work with my team members, Conor and Chenglyn on this webinar.
– Hi, I’m Chenglyn. I am on the machine learning practice team working with Brooke. I’m a data science consultant. Some of you might recognize me before for our previous webinar, but I did (mumbles) trainings and also help customers in premise design solutions.
– Awesome, thank you. So now we’ll pass it over to Conor, who is our instructor for today. So, Conor, take it away. – Perfect, thank you. Yeah, so my name is Conor Murphy. I work closely with Chenglyn and Brooke on our machine learning practice team. And so that’s a hybrid team where we are consultants for machine learning and data science projects. And we also do some instruction as well. And so, great. Let’s go ahead and get started. Let me go ahead and share my screen and so let’s see. Let me just make sure this is the right screen.
There we go and hopefully everybody can see this okay. And so to go ahead and get started, let’s first make sure that we have access to Databricks. And so you can either go to databricks.com in the upper right hand corner. You can click try Databricks and you can sign up here or if you were with us last week, instead of doing that, you can go to community.cloud.databricks.com and you can access that here. And so I’ll just give everybody a minute to go ahead and get started there. And once you have signed into there, go ahead and drop a quick message on the chat so we know that you’ve gained access to it.
And if you have any challenges along the way, feel free to post in the Q$A channel as well.
Cool, it looks like a couple of people have gotten in.
Awesome, we’ll just give it a couple more minutes.
So it looks like a lot of people have access to it. So why don’t we go ahead and keep on moving and then if you have any challenges along the way, just feel free to post that in our Q&A and we can go ahead and have our TAs help you out throughout that process. But once you’re signed in, you should see a learning page that looks something similar to this. So if you were with us last week, this should look pretty familiar. If you weren’t with us last week, that’s totally fine as well. So this is the Databricks environment. Among other things, we allow hooks in notebooks that execute code against clusters. And so those clusters are just resources that we’re going to create for you so that you have somewhere to actually run your code. So instead of running this code locally on your laptop or your local computer, you’ll be running this in the cloud. And so you’ll be running this on Amazon’s infrastructure. So on the left hand side of the screen, you can go ahead and click the clusters tab. And then you can click, create cluster. If were with us last week, you might see a cluster that’s already here that just needs to be restarted. You can click, create cluster here. You can give it a name, you can call it whatever you want. I’m just gonna call it this pandas. And then I’m gonna click, create cluster. The rest of these configuration should be totally fine.
So I can click, create a cluster, like that. And then in a moment it’ll appear here on the screen. And I’ll give you guys just a second to catch up with that.
And then go ahead and let us know in the chat if you’ve been able to spin up a cluster okay.
So it looks like somebody asked if they can use the previously created cluster. Yes, you can use that cluster. You just need to restart it.
– [Brooke] Sorry, actually that’s not enabled on Databricks CE. It’s enabled in the enterprise version. This one you will have to recreate it from scratch. – Got you, thank you, Brooke. – [Brooke] Yeah.
– Do I have to register my credit card? No, you shouldn’t need to register your credit card for this. So this is a completely free offering that just allows you to run code within the Databricks environment.
Cool, so it looks like many of you have this up and running. And so the next step is to import the notebook that we’re gonna be using in today’s lesson. So if you click on the home button up here in the upper left hand corner, and then click this dropdown here, you can then click in code. And so this will allow us to import that notebook. So once you click import, click from URL and then, let me paste this link here, you can import from this endpoint here.
And so I’m just gonna paste that in there, click import. And then in a second we’ll see a new notebook appear here. And so it should look something like this.
And once again, I’ll just give you a few moments to take care of that.
And just drop us a note once you have that notebook imported and your cluster started.
It looks like some people, their clusters are still pending. That’s totally fine. So those should get started relatively soon. And so normally it takes, I don’t know, maybe one to three minutes or so to start that cluster. And so if I click this dropdown menu here, I’ll see this cluster that I have available to me and I can see by the greenlight here that it is running. And if you have a spinning wheel there, that just means that it’s still starting up. And so this is a free resource. This is what’s actually going to be running the code that we’ll be executing from this notebook. So we have about 15 and a quarter gigabytes. We have two cores, and the rest of these configurations are just the version of the software that we’re running. So from that dropdown menu, I can just click on that cluster and then I will have attached this notebook to that cluster. So now I can execute code on that cluster.
So cool, let’s go ahead and get started. So today we’re gonna be introducing data analysis using pandas. So we’re gonna start off, we’re gonna motivate why pandas matters. Talk a little bit about what it is and its history. And then we’re going to go through at a high level, how to use this package. So we’re going to be using the COVID-19 dataset. And so this is going to be available for you. There’s also a lab after the fact. And so you can go through, do your own analysis, and drill down into your country or your state or your area and get a sense for what those trends are.
And at the end, we’re gonna be doing some visualization as well. I also wanted to give you a few resources that you can use after the fact. And so you have a couple of links here. First, this is a cheat sheet. So this is a cheat sheet provided by the creators of pandas.
And it’s a really effective tool for jogging your memory when you’re trying to remember how to do something. And so to give you a sense for the basic syntax, some of the basic operations that you can do on data, some of the filter operations, how to handle missing data, that sort of thing. And so that cheat sheet is incredibly helpful. I also linked the pandas docs here as well. And so the pandas documentation looks something like this. And this is going to be the main reference that you use if you ever get stuck on anything. So there’re a couple of helpful getting started guides, so you can always check these out. You can go through these independently. There’re some video resources on here as well, but this should give you just more resources that you can be more successful. And then if you need to look at the API reference docs as well, you can see that here. And so this will give you a lot more detail on each and every one of the functions or methods that we’ll be using in the next hour.
Cool, so let’s get started.
I’ve taught a fair amount of intro programming in my class or in my career. And one way that I always like to talk is just with the insanely, insanely big picture. And so first off, just motivating why do you use pandas? Well, one of the reasons why humans are such a successful animal is because of our use of the tools. And computers are arguably one of the most powerful tools we’ve ever created. And so if you can truly write code, you can unlock the full power of those tools. And I always find that to just be incredibly exciting. And so I like to share that with anybody when I first start teaching introductory programming. So that’s a little bit of motivation for why writing code is so effective. But now why would we wanna use pandas specifically? And so pandas is a data munging library available in the Python environment or within the Python programming language. And so more and more industry trends are showing that data is leading the decision making. And so you can use similar tools. You can use something like Excel instead. However, pandas allows certain benefits beyond something like spreadsheeting tools. So for instance, if you wanna automate your analysis so that reruns on any new data each day, you can find ways to do that within Excel, but it’s generally a little bit challenging. And so code-based tools such as pandas allow you to do that. If you want to, say, build a code base and share it with your colleagues, you can do that using Python and using pandas as well. If you want more robust analyses that feed your decision making processes of your business, pandas allows you to do that, and if you wanna do machine learning as well. And so the next course in the series, part three, next week, we’ll be focusing more on machine learning and it’ll give you an introduction to how to do that. And so if you’re a data scientist and if you’re a data scientist using Python, you’re likely going to be using pandas quite a bit. And so pandas is going to allow us to manipulate the data that we need so that we can feed it into machine learning models so that we can do visualizations, so that we can calculate reports, so that we can do all of those things. And so pandas is absolutely an essential library for anything that you do as a data analyst or data scientist within Python. So that’s pandas. So at a high level, what is pandas? So it’s an open source library. So open source means that anybody has access to it. You can use it free of charge and you can see what the underlying source code is. So basically it’s a bunch of code that somebody else wrote for you and you can use that in order to manipulate data. So it’s used for data analysis, data manipulation, and it’s built on top of Python.
So it started back in 2008 and it was fully open sourced back in 2009. I’m not gonna go through each one of these different benefits of using pandas, but you can see that there’re quite a number of different functions or number of different features within pandas that can be incredibly helpful. And so we’re mainly gonna be interacting with this DataFrame object. So this would give you the data in a format that you might expect, right? And so it’s a tabular data format, somewhat similar to a spreadsheet, but it allows you to have a lot more powerful functions than you would find within a spreadsheet. So it has optimized tools for reading and writing data. It has tools for dealing with missing data. You can reshape it, you can pivot, you can index, you can do all sorts of really interesting, really helpful things. There’s also some time series functionality and there’s also some visualization functionality as well. And so we’ll be looking at many of those features today. And then if you’re interested, there’s a book here that you can check out as well. This is, just open it here. So this is “Python for Data Analysis.” This was written by West McKinney. West McKinney is the original creator of pandas. And there were a ton of helpful exercises and helpful walkthroughs here. So if you ever get stuck or if you want a deeper dive, I’d definitely recommend checking out this book.
Cool, so let’s get started with the COVID-19 dataset. And so this dataset is gonna be available for you through Databricks datasets. So this is just a number of different datasets that we host for you so that you can more easily test out different code, look at different features. So I can use this % sh ls and then have this pack here.
And so if you’re familiar with Jupyter Notebooks, you might be more familiar with an exclamation point when you’re doing a command like this. But just allows me to take a look at what’s actually within this file system. And so if I wanna navigate this file system, I can see that within Databricks datasets, there’s this COVID folder and then I have a number of different sub-directories within here. So this is the sub-directory we’re gonna be using that I wanna navigate that and see what exactly is in it. I can just paste that here and take a closer look at what’s actually within this. So under the CSSE COVID-19 data, you can see that we’re gonna be using these daily reports. And so if I call % sh ls on that directory, you can see that we have a ton of different CSV files here. So it looks like this starts at around January 22nd and it goes up as recently as 4/11. And so we’re gonna be using primarily this CSV file here. Though towards the end we’ll combine all of these CSV files together so that we can see daily trends.
So if we wanna see the first few lines of this dataset, we can use % sh again and then head on it. And so % sh is just allowing us to run bash commands. And so bash commands are something, if you’ve ever worked in a terminal window, these are the same commands that you would run there. But this just allows us to have really granular access to what’s going on within this cluster that we have available for us. So if we take a look at the first few lines by using head, we can see that we have a number of different columns here. We have this FIPS column, we have admin 2, province, state, country_region, et cetera. And so we can go through what exactly those meaning in just a moment.
And so these were all what are called bash commands. This is actually running Python code. So here we’re going to import pandas. So like I said, pandas is an open source library that is made available to you. So it’s a bunch of code that somebody else wrote that you will have available to you. This is what’s called an alias. And so we’re gonna be importing the pandas library. And then when we use this as pd, this is what you will see time and time again in data science code. So this is the convention that we use in order to access pandas. So now we can call something like pd.reedcsv and we can use the reed CSV functionality from pandas. So let’s go ahead and execute this command. And just for reference, by pressing Control + Enter or Command + Enter, depending on what type of system you’re using, that’s how you can execute these cells. You can also click over here and click run cell here too if you prefer to do that.
So I call pd.reedcsv and I pass in this long path that gives me access to the daily report from 4/11-2020. And so here we can take a look at what’s actually within this file. So first we have this index, so this is just the index for a given row across this file. Then we have this FIPS number. So this is a specific convention for identifying different areas. So this is a, I believe it’s a US federal convention for it. It’s not really that important for us. The admin 2 column, if the area is within the US, this is going to be the county within the US. And so we’re gonna see some no values for other countries that don’t have a admin 2 value. And province_state, if it’s within the US, this is gonna be the state within the US. If it’s not within the US, it’s going to be the province or region. And we’ll take a closer look at that in a moment. We have the country or region, which is what you would expect. The last update, we’re not really gonna be using that, but that’s when this data was updated. We have latitude, we have longitude, and then we have the actual core data that we’re going to be working with. So we have confirmed, which is the number of confirmed cases. We have deaths, which is the number of the number of people recovered, be somehow careful with this column because not all countries and areas are consistently reporting recoveries from COVID. And then we have active cases as well. And finally we have this combined key. And what the combined key does is just combine a number of these different values. So it combines admin 2 with province_state, with country_region.
Great, so I’m gonna scroll through this and you can see that we have 12 columns. And within this file, we have just shy of 3000 rows.
So before, you saw that we were just executing one line of code in a given cell, we can combine multiple lines of code easily enough. And so I’m gonna rerun this. And so it’s gonna run the import for pandas and then it’s going to save this CSV file as df. And so this gives us an object so we can continue to play around with this. We don’t always have to be reimporting this dataset. And so now we have this df variable and we have this dataset saved in memory, and so we can easily access it and execute code against it.
Great, so first let’s talk a little bit about summarizing the data. So these are some of the basic pandas operations that you’re gonna be using time and time again. And so whether you’re a data analyst, a data scientist, whatever the case may be, one thing that I’ve noticed a lot of individuals do with quite a bit of success, is they have a basic walkthrough for summarizing new datasets as they interact with them. And so oftentimes they’re going through and you’re calculating the same summary statistics anytime you see a new dataset. And so it’s really helpful to have a good template for doing what we would call exploratory data analysis. So if you’re a data scientist and you’re trying to build some sort of data science solution, first you’re doing exploratory data analysis just to figure out what’s going on within your dataset. And then that process becomes more robust. And then eventually you apply machine learning to the process. And then some magic ensures and your businesses are automatically more successful. It’s something like that, but okay. So let’s talk a little bit about tab completion. So if I uncomment this code, so this is a comment in Python. Basically, it just indicates the Python that you’re going to ignore whatever comes after it. If I uncomment that code, and I call df., and then hit Tab, I should see the different methods that are available to me. So recall that this is a pandas DataFrame. We use pd.csv in order to import that DataFrame. And when I call .tab, as long as there are resources available on my cluster, I’ll be able to see the different methods that I have associated with it. So for instance, I can call df.abs. And so that’s gonna give me the absolute value in a code. Df.active, not sure what that is, but might be helpful in some case. I can call df.admin2, and that’s probably gonna give me the admin 2 column. But you can see that there’re a number of different methods available here. So this is all code that’s been written for me, and then I can access by using pandas.
So we can take a look at how that works. But first, let’s take a look at this df.admin. If I ever get stuck and I need to look at the actual documentation to get a sense for what’s going on, I can just use this help command. And so if I call help on df.head, it’ll tell me a little bit about what this does. And so here, I can see that I have this class DataFrame and we have a number of different parameters associated with it. And so it gives me some examples of how I can use it as well.
And so it’ll tell me the different methods associated with it and give me some details on it. And so here, we can see that there’s quite a bit of information here and that’s because pandas is a very robust library. There’s a lot of functionality available within it. Cool, so if I just call this this a df.head without help around it, here I can get a sense for the first few lines of the dataset. And so this will give me the first five rows of the dataset. That way I don’t have to keep on printing out the whole thing over and over and over again. So here I can get a sense for the first five lines, looks like these are all areas within US. And then if I want the inverse of that, I can just call tail. So df.tail is going to give me the last few lines of it. And so by default, df.tail will give me five lines, just like df.head will, but if I wanna shorten that or lengthen that, I can add a parameter here. And so these arguments rather are how I start to customize my code. And so here I can pass a value, and then in the case of head or in the case of tail, this indicates how many rows I actually wanna be able to see.
And so if I wanna know how many records are total, I haven’t total my dataset, I can call it this df.shape. And so this will give me the total shape of the dataset. And so the first value here is the number of rows that I have. So I have 2,966 rows. And the second value here is how wide my dataset is. This is how many columns I have. And so in data science and data analysis, normally you refer to this as features rather than columns because features highlight the fact that you are going to be putting this into some sort of model in the future.
And so now there’re a ton of different ways that I can summarize my data. And so one thing that’s worth mentioning is if I do command/, I can comment an uncommon code pretty easily. And then on a Windows machine, that might be Control/ but I play around with it a little bit to confirm that. So if I just call df.sum, this is going to give me the column-wise sums across my dataset. So for FIPS, that’s that number, that convention for regions. You can see that it’s summed up that column and I get some astronomical number. So this is 8.3 times tenth seventh. So that’s a huge number for country_region. It’s adding together all the USUSUSs, that’s not super helpful. Last update time, it’s adding together all those strings. So that’s not super helpful. Lat and long, that’s not very helpful either. But these numbers are helpful. So now it gives me the sum of the confirmed cases, the numbers of deaths, the number of recovered, and the number of active. And so this is quite helpful. So this gives me a sense that, at this point in time, in their mind that this was on the 11th, so this was maybe four or five days ago, the number of confirmed cases it looks like it was, what is that, 1.77 million cases. And I know today it’s a little bit higher than that, actually quite a bit higher than that. But this gives me a sense for generally speaking, what are some of the summary statistics that allow me to get a sense for what’s going on within this dataset?
So if I wanna look at the minimum values, I can do that as well. So I can see that the lowest FIPS number is 66, I can see the lowest latitude and longitude, if that’s helpful for me. And the lowest number of confirmed cases, zero deaths, zero recovered, zero active, zero. And so across this dataset, you can see that there’re countries or areas here that do have zero confirmed cases. And it’s worth mentioning that this is column-wise minimum values. And so it could be that one country has zero confirmed cases but a different country has zero deaths. And this country is not necessarily Afghanistan. However, Afghanistan is from a string perspective, right, ’cause this is a string value. This is going to be the minimum value.
So I can take a look at the maximum values as well. And so here I can see that the maximum confirmed cases is 163,000, and we’ll go into a little bit more detail with that in a second. If I just do a count, this will give me a count of all of the non-empty values here within my dataset. So now I can see that these counts are going to look about as long as, or these are gonna be reflective of the number of rows that I have in my dataset.
And I can look at average values as well. And so average values will tell me, let’s see, on average how many deaths, how many recovered, how many active for different areas within my dataset. Now it’s worth bearing in mind that if, within the US, these are aggregated at a county level. And so some of this information might look a little bit off because it’s not all done on a country level. So we’ll handle that issue in a second. But finally I can look at the standard deviation. And so the standard deviation is going to give me an idea of the spread of my data. And so how much is this data spread out across it’s mean?
Great, so that’s one way of calculating these values on a column-based state or on a column basis. If I call this df.describe, this will do many of these operations for us. And so here I can look at, across these different numerical values. So describe is going to ignore the non numerical values. Across these different numerical values, I can see the counter values, the mean, the standard deviation, the minimum, the maximum. And then these are my different quartiles. So the 50th percentile, this is gonna be my median my 25th percentile and my 75th percentile. So this gives me a with my data. So how much has it spread out (mumbles) Are most of the values generally right around the mean or is a lot of my data are farther away from the mean? So that’s what standard deviation in these percentiles is gonna tell me. So I can get a sense for what the average number of case is. So the average number of cases is gonna be around 598. And so bear in mind, this is gonna be county level data within the US, so that’s gonna off these numbers a little bit and we’ll handle that in a second. Cool, so now let’s go on to slicing and munging data. And so this will give us a sense for how to do some more robust manipulations around it.
So if I just wanna pull out the confirmed cases, I can use this syntax here. So this is pandas specific syntax. If you’ve used Python in the past, you might recognize these square brackets for lists within Python, but generally speaking, you don’t use this syntax a lot elsewhere. This is pretty specific to pandas, but it will become it a lot more natural the more you use it. So if I just call df of confirmed here, I can get a sense for that one individual column. And so here I can just see the individual values from df confirmed. And then if I wanted to, if I wanted to shorten this, I could call .head on this as well. And so one thing that’s helpful to know about pandas is it’s very common to use what’s called method chaining. And so here I index, I grab this column and then I add a method to it. And so it’s common to go through and chain a number of different methods together. And so here I can start to use these individual components and make a slightly more complex analysis based upon it. And so you will see the syntax quite a bit. And you’ll see as we go on in this notebook, this will become more and more complex.
Well, cool. If I want to take a look at the number or the columns that I have, I can take a look at it that way. And then if I wanna index multiple columns, I can do it this way instead. And so here I’m using those same square brackets, but I’m passing in a list. And so here, I have double square brackets and that just indicates the pandas that I want multiple columns here.
And if I wanna add a new column, I can do it this way. So I can call df.date, date doesn’t exist here, but I can pass in this column. I’m gonna use this date time object. I won’t go into too much detail about it, but just know that this allows me to have a specific formatting specific to a date. And then if I take a look at the first few values of it, it looks something like this. So here I can create new columns from either old columns. I can say take two columns and add them together, or I can create a new column and have it all be the same value. In this case, it’s a daytime.
And so that covers how to index on a column level. If I wanna index on a row level, I can do something like this. And so I can call df.loc, L-O-C, so that just gives me the location. And here, the first value I pass in is the specific rows that I want. In the second value, this should look familiar as the columns that I want. And so this is gonna evaluate to be just about the same thing as this. And so here, it says I wanna go from the zeroth value. And so Python is zero indexed. And so that means, if you’re indexing into something, you start with zero rather than one. So I tell it that I wanna go from the zero to the 10th value. And so that’s how I pull out those values here. And so this is gonna be the same thing as this first line. And so this gives me a sense of a how to slice on a horizontal and a vertical or from horizontal and vertical perspective. And if I just wanna pull out the first column and the first row, I can do something like this. And so that’ll give me the first FIPS number that appears within the dataset.
Okay, so I talked a little bit about how the regions are a little bit different across this dataset. So if I wanna get a sense for what that actually looks like, I can call this .value_counts. So on the country_region, so we’re on that one column, I’m gonna call it .value_counts and it’s just gonna tell me exactly how many regions I have for that country. So in the US, you can see that I have about 2,700 and that’s because we have county level information within the US. And so within the US, we have about 2,700 counties that are represented here. On China, I believe these are provinces, Canada as well, UK, France, et cetera. And so you can see that for a handful of these countries, I have more granular data than for other countries. And then for countries like Tunisia, for Chile, for the UAE, for Fiji, here I only have data available on the country level.
And so what’s FIPS? We talked a little bit about what it is, but we don’t really care. FIPS is not something that we necessarily wanna use. And so if we wanna drop that column, we can use the df.drop and then we can call FIPS and we can give it an access. And so this access just indicates that we wanna drop the column rather than dropping the row. And so it’ll look for a row and that’s, we specify access is equal to one. And so we can run it like this. And then if we call df.columns, we should see that these columns are all the same as they were before, but we no longer have the FIPS number here. And so bear in mind that here, when I call it df.drop, I re save this to df. So I overwrote the original DataFrame with a new DataFrame, which is df but without the FIPS column.
Cool, and so now we can take a look at confirmed cases. So if I want to sort my DataFrame, I can call df.sort_values and then I can pass into column that I wanna sort by. And so in this case, I’m gonna sort by confirmed cases and I’m gonna set ascending to equal false. If I set this to be equal to true, and bear in mind that you always capitalize true or false, if I set this to be equal to true, you could see that I have all of the zero cases first. But instead, I want this to be equal to false. So now you can see that I have the highest number of confirmed cases in Spain, second highest in Italy, then France, then Germany, and then US. And so this looks a little bit odd. And the reason why this looks odd is because once again, we have that country level granularity within the US. And so this is specifically within the New York City area. And so here you can see as 4/11, we had about 98,000 confirmed cases within New York City.
Okay, so if we just wanna look at what’s going on within the US, we can call something like this. So we’ll call it df[Country_region] == US.
So if I run this, this will only give me the DataFrame where the country_region is equal to US.
And so here I can always just call .head on this to make it a little bit shorter and have to do a little bit less scrolling at the end of the day.
Okay, so this is how I filter a column-based upon one value. What happens if I want to filter a column based on multiple different values? So if I wanna do something like that, I can call it df[Country_region] ==US, I’m gonna throw all these in parentheses and then I’m gonna add this 10% and then I’m gonna add province_state is equal to California and admin 2 is equal to San Francisco. And so this is where I’m based out of right now. So I live in San Francisco County, which is the name of the county is the same as the name of the city. And so here as a 4/11, you can see that there were 857 confirmed cases, 13 deaths, and the recovered number, I don’t really trust because I’m not convinced that all of these different regions are reporting recovery numbers in the same way. But generally speaking, I can trust the confirmed number and the death numbers.
Cool, so that gives me a sense for how I can filter based upon the multiple values. And you can play around with this. You can change this in order to match the area that you’re in.
Okay, so now what if I wanna answer different questions? For instance, what country has the greatest number of confirmed cases? So in that case, I wanna be able to group my data by the country and then perform some sort of operation on it. So I can call df.agroupby and I can group by country or region. And what this returns is this DataFrame group object. And so all we really need to know about this is this is a specific type of object and is returning to us, but it really needs some sort of other additional method on top of that. And so here I’m gonna do df.group on it, I’m gonna have that same country_region, then I’m just gonna pull out the confirmed cases. I’m gonna call .sum on it, and then I’m gonna sort the values. And so this is just that chaining that I talked about before. And so sometimes this can be a little bit challenging to read, but when you’re writing this code, normally you start by something like this. So you might start by saying df.groupby, and then you wanna do confirmed, so it just gives you the confirmed values. And then maybe you do this sum on top of this, and then you see that I get sum across all of these different values and then you would change sort values on top of that. And so when you’re coding, you’re normally doing this in an iterative process. It’s not like you’re writing all of this code at once. And so when you see these longer chains of methods, never be intimidated by it. The person who wrote this originally was writing one method at a time, usually executing it, making sure it looks like they were expecting it, and then going on to the next one.
Cool, so now that I’ve grouped by country or region and looked at the confirmed cases and sum them, it looks like the US has the highest number of confirmed cases, number two is Spain, number three is Italy, and France, Germany, and China. And so it’s like it’s the US first, then a number of European countries, and then China after that.
And so if I wanna ask the question, which US states have the most cases? I can start to chain up what we’ve done before. So here I’m gonna filter by country or region is equal to US. I’m gonna group by province_state.
I’m gonna pull up the confirmed values, do a sum, and then sort values.
And so here, this gives me state level information from the United States. So I see New York has the highest number of confirmed cases, followed by New Jersey and Michigan, Massachusetts, et cetera.
Great, so now I wanna move on to talk a little bit about missing data, duplicate values, that sort of thing. So this is the reality of doing any sort of data analysis work or data science work, is that you’re always dealing with outliers, no values, that sort of thing. And so just being able to do this in a pretty efficient way allows you to avoid spending too much time doing that. So I can always call this df.isnull and this will return trues where a value is null and false otherwise. And so the way that I would read this is I look at, say, the admin 2 column here and see that I have at least four different no values within the tail here. And then country_region, it looks like all of these are null no values. So if I just call sum on that, it’ll give me column-wise number of no values that I have. So a lot of these are missing admin 2 values. That’s because we don’t have county level information for a lot of this data. And you’ll see some of the other missing values as well, including lat long for 59 different records I have.
So now let’s see how many different unique countries we have. And so here I can call df[country_region.] I can call .unique and then I can call .shape afterwards. And then I can see that I have about 185 different unique countries here. If I wanted to do the same thing, I can always just call drop duplicates as well, and this will give me the unique values as well.
And then finally, I can always use this df.fillna and then I can pass it in some sort of value to fill with. And so here I call it df.fillna and I pass in no data available. So now you can see where I did have missing values. Suddenly I had the string instead. And so this might be helpful if you wanna do what’s called imputing, where you fill in no values with some sort of standard value. Cool, so the last thing I wanted to cover was visualization. And so within languages like Python or R, you have really, really robust visualization libraries. And so what we’re gonna be using here is what’s available within core pandas, but there’re many other libraries that are available to you as well. And so a lot of pythonists start with matplotlib. This is the main plotting library within Python. There’re also at our library such as Seaborn and Pyplot and a number of other ones. And so these are largely open source libraries and so you can use these in quite the same way. But I’m just gonna run this, import for mapplotlib and then call this % mapplotlib in line. All of this does is indicate to Databricks that we wanna render these images for me. And so we wanna make sure we show whatever’s there. So I’m gonna be looking at just the US-based data. So I’m gonna go ahead and create a subset of my original DataFrame. I’m gonna call it us_subset_df and then I’m just gonna pass in this code that you saw before, df[country_region] == US. Cool, so now I have this subset. So now we can start to ask different questions about, say, the number of distribution of deaths by US states and territories. And so oftentimes when we talk about distributions, we’re talking about for deaths. What generally speaking is the number of different values that we have for different states. And so here we can see that we’re, sorry, let me take a step back. So the way that this is operating is we’re going to bend together the number of different debts that we have. So say all of the deaths between zero and 50, we’re gonna put together in the same bin and we’re gonna visualize it this way. And so that way we can get a sense for are there any outliers within my dataset? And so if we take a look at this one, here we can see that there’s one outlier out here, but normally we have much lower number of deaths. But if we want a little bit more granularity, we can change this bins parameter within this histogram. And so here I’m just gonna go through, do a group by, sum of the deaths. So this will give me deaths by state. And then I’m going to to call .hist in order to get a histogram. And then PATS in this bins parameter. And so if I execute this, you can see that I have a histogram that has a little bit more granularity. So you can see that I have one outlier way out here. This is likely gonna be New York. And then you can see the majority of the states have much lower death accounts.
So if I wanna see how confirmed cases relate to deaths, I can use that DataFrame called .plot and then call .scatter. Then I just need to define what’s on the X-axis and what’s on the Y-axis. So this allows me to visualize confirmed versus deaths. And you can see that there seems to be some sort of correlation between these two and then I have this outlier way out here. If I wanna remove that outlier, I can call this us_subset_tf. I can call any deaths less than a 1,000. And then I can call that same code. And so this allows me to clot those same values, but without that outlier. And so here you can see once again, that correlation between confirmed cases and deaths.
And so I’m gonna gloss over this code ’cause I know we’re a little bit short on time, but all this is allowing us to do is take all of those different individual CSV files and combine them all together. So I’m just going through each one of these. If you were with us last week, this (mumbles) should look somewhat familiar. But you’re reading in that file name, you’re changing the columns a little bit. And the reason for this is not all the CSVs line up perfectly. And so this is just some code to clean that up a little bit. I’m adding in the date and then I’m combining all of these together. And so basically I combine all of these different data frames within this list and then I can catenate them together at the end. And so this just gives me data across the entirety or all of those days that I have available to me. And so it’s gonna look something like this. I dropped out some columns as well because of the issue with columns not matching up. But this will give us the core data that we’re looking for. So now we can start to ask questions about how the disease spread over time. So if we group this information by date and look at confirmed cases, so we want the sum of confirmed cases and then we wanna call .plot on this. And so this gives us a sense for the increase of cases over time. So the data goes back to January, and you can see this increase. And so I added this title here. The title is gonna appear at the top of the plot and then I added this rotation as well. So this rotation is for the X-axis. If I didn’t add this rotation, all of these numbers for dates would all overlap with one another. So I added a 45 degree rotation just so I could see this a little bit better.
So now if I wanna break this down by the types of cases, I can group by date, I can pull up the confirmed death and recovery rates. I can call it .sum on it, so I have a sum of each one of those different values. And then I can call it .plot. And so I give it a title, I give it that same rotation and we can get a sense for the number of confirmed cases, the number of deaths, and the number of recoveries as well.
And so then if I wanna look at what is the growth within my country, I can start to look at it this way as well. And so this might look a little bit complicated, but basically I wrapped all of this within parenthesis so I can divide this code up over a couple of different lines. Then I subset it by country_region, province_state, admin 2. So this is the same code we saw before when we looked at San Francisco. I grouped by date, pull up confirmed deaths, recovered, call it .sum and call it .plot at the end. And so this gives me the specific results for San Francisco.
So you can see the number of confirmed cases is increasing somewhat linearly. And we see we have very low number of deaths and recoveries, relatively speaking. Cool, and so this final function, this just wraps, this line of code that we saw up here, it wraps it all up into one function. And so if you were with us last week, you’re familiar that you can (mumbles) this code up with (mumbles) to function and so that you can add your own arguments to this function. And so now I can just call plot my country and then I can see US New York, New York City, but I could add in different parameters for my specific state or region as well if I wanted to as well. And so you can play around with this and you can use this to customize the result for your own area. So I know that was a bit of a fire hose of pandas. So you’d have access to this notebook. You also have access to a lab as well. And so the lab is going to give you some starter code and it’ll just have you go through some plotting exercises so you can get a sense for how to actually visualize what’s going on in your area. So I want to make sure we had a few minutes at least for some questions. And so why don’t we go ahead and switch to that? So if you have any questions, feel free to include in the Q&A. And then I also ask the TAs to include some questions that came up with some frequency so that we could all address them together. – [Karen] Yeah, Conor, great job. There’re quite a few folks who are asking about pandas versus Koalas and also pandas versus Spark, if you wanna talk about the differences between single node and distributed computing. – Perfect, that sounds good. So the library we used today is pandas. Pandas is what we call a single node technology. And so when we run this DataFrame, when we run any pandas code, it’s always only working on one single machine. And so for those of you who are familiar with Spark, so Spark is a distributed computing system. And so if you have more data that can fit on any one machine, then you’re going to wanna use something like Spark rather than pandas because pandas works really well. Data scientists use it all the time, but it cannot scale beyond one single machine. And so when you use something like Spark, it allows you to distribute that. And so Spark also uses a DataFrame, but that DataFrame is actually, you have a little bit of your data sitting on different nodes across your cluster. And so Brooke mentioned Koalas as well. Koalas is a way of writing Python style code but against a Spark DataFrame rather than a pandas DataFrame. So it’s going to allow you to use this same syntax, but your code is gonna be able to scale across a cluster of machines. And so this will be, if you’re working on any sort of big data problem. And so if you have more data that can fit on any one machine and you don’t know how to work with it, then you’re gonna wanna look into technologies like Spark and Koalas.
– [Karen] Great, and then a few other questions about how do you add records to a DataFrame?
– Well, that’s a good question. So we looked a little bit about how to add columns to a DataFrame. Then there’re a number of different tools that are available for you. If we take a look at the pandas cheat sheet, we should have some indication of how to do this. So there’re a number of different ways that you can combine data. So you can look at these different reshaping functionalities, but one that you might wanna look into is this pd.concat. So this will concat two different DataFrames together. And so you can do this, you can stack them horizontally, you can stack them vertically. But these types of values are something you’re going to wanna look into. – [Karen] And then as a followup to the earlier PySpark versus Panda’s question, somebody’s asking about, is there any specific threshold when you’d use one over the other like 10,000 records, 100, 000 records, what are some of the considerations that you take into account? – Yeah, so the main thing you have to consider is how much memory you have available to you. So if you have eight gigabytes worth of memory on one machine, and the data that you have is eight gigabytes, you’re probably going to need to use some sort of distributed technology Spark or whatever the case may be. And so that’s one thing to consider. It’s more than size in memory rather than the overall number of records that you have. And bear in mind that if you have eight gigabytes of memory available to you and eight gigabytes worth of data, that’s likely not going to work because you still need a little bit of extra memory available to you to do any sort of operation. And so I would start to think about using something like Spark, once you’re on the order of maybe tens of millions of rows, but like I said,