With the current COVID-19 pandemic impacting many aspects of our lives, understanding the data and models around COVID-19 data are ever more crucial. Understanding the potential number of cases impacts the guidance around our policies (needing more hospital ICU beds, when to ease stay at home orders, when to open schools, etc.). In this session, we will focus on some exploratory data analysis to understand the accuracy of these models. We will then use machine learning models to improve them.
– Thanks for joining us today, in the session we’ll be talking about improving the IHME Covid Model.
On the session today, I’m Scott Black, and we also have Denny Lee as well.
For the agenda, first of all we’ll be talking about how to build a COVID-19 Data Lake and some of the challenges around doing that, and then next I will be speaking about how we can use our COVID Data Lake to try and improve the predictions that we’re getting from the IHME model.
So first, a little bit of background about myself, I’ve been working in a pre-sales role for about the last 10 years helping different organizations getting value out of their data, and then before that, I spent about another 10 years or so, and coming from the RDBMS world working quite heavily in e-commerce and in the healthcare sector and in during that time I also helped contribute to several Oracle books. Now I’d like to introduce Denny. – Hey, thanks very much Scott. Hi everybody, my name is Denny Lee and I’m a Staff Developer Advocate here at Databricks. I’ve been working with Spark since around zero dot five. Formerly I was a Senior Director of Data Science and Engineering at Concur, also a Principal Program Manager at Microsoft where I was actually responsible for helping to build Azure HDInsight, one of the nine people to co-help create it, as well SQLCAT BI/DW Lead at Microsoft as well. So, let’s get started.
– So let’s start by trying to understand the Johns Hopkins dataset that we’re gonna be using to better understand the IHME model and try to improve it. As you can see from this particular graph, this is a visualization of COVID-19 cases by county per a hundred thousand people between March 22nd and April 14th of 2020. As you can see from the animation, the rapid exponential growth that you see here of the COVID-19 cases in the United States. If you wanna recreate this, there’s actually a notebook linked in this notebook that will allow you to recreate this particular visualization of the Johns Hopkins dataset.
Now what’s interesting about the Johns Hopkins dataset
is that the schema of it actually changes overtime. It is a very, very useful dataset to understand COVID-19 cases. But as you’ll notice for example here on the 21st, there are per what see, eight columns that are available for you to process the data.
I was to look at the next day on the 22nd, you’ll notice that there are 13 columns. And so the first notebook that we’re showing here, which we also will provide to you, is a Data Prep that allows you to go ahead and take in account of the fact that the schema of the data changes over time.
So if I skip ahead to this particular code snippet, you notice that that’s exactly what’s going on here. There are three different schemas as of when we originally wrote this notebook, which is on the April 8th 2020, in fact, there’s actually a fourth schema change that just happened recently at the time of this particular session. So nevertheless, we have the different schemas, this case that we’ve listed out based on the timeframe that we’re working with. And then finally, when you notice that the processing of this notebook kicks in, you’ll notice the different schemas for the different dates before the first of March, they’re these set of columns but from March 1st to March 22nd, they’re these set of columns and so forth and so forth, on March 22nd onwards. Now, what we did is we created a Data Lake table so we could keep track of not just the information that was recorded, but also any optimizations that we wanted to do, and as well if that we didn’t want to do any deletions or updates to the table, we had a transaction log to keep track of us. This is very handy, especially when you’re dealing with schema changes. So for example, here are some examples of what we had done to the table. The other thing we did, as you noticed from the previous notebook, we also included per 100,000 cases, so we needed to get the 2019 population estimates. So we download this from the US Census, and here you go. Created state and county table similar to the one you’re seeing here right now, that allow us to go ahead and not only provide the overall case numbers as per the Johns Hopkins dataset, but also include the per 100,000 people, so you can actually have, understand the population densities.
So naturally, I’m gonna want to go ahead and explore the data as well. So the second notebook we have here is purely EDA, exploratory data analysis. So I’m gonna skip ahead again, we’ve already read that Data Lake table that has this information. But you’ll notice that we wanna go ahead and specifically look at the case ratio. Now what is case ratio? If you take a look at the SQL here that I’ve written, it basically is doing a seven day average for each day. And we’re wanna see what the seven day average for one day versus the seven day average for the previous day, and doing comparison between the two. And so over time, you’re hoping that the previous day, sorry, the current day seven day average is less than the previous seven day’s seven day average, okay? So let’s take a look. We’ve ran the query already, which notes the case ratios for just a specific set of states like Washington, Utah, New York and California. And so over time, this is going from May 11th to May 25th. You’ll notice that generally it’s going down. But I wanna focus a little bit here about Washington state because there’s a nice dip here. And then we jump back up and I’m wondering, based on the case ratio, is that a cause for concern for my state? I’m from Seattle, okay? So let me dive into a little bit about the the initiative by Washington State, it’s called Safe Start Washington. It’s very similar to the US government’s own CDC requirements, but the idea is that there is a four-phased approach and how you would go ahead and go back to normal as per say when it comes to opening restaurants, recreational facilities and things of that nature but with a nice city four-phased approach, assuming the number of COVID cases actually stays down or at least plateaus. And so you’ll notice that the counties that have a higher population, which is here looks Snohomish, King and Pierce. These counties are still in phase one, while the more rural counties they are actually on phase two, because they’re able to go ahead and because they have lower number of COVID-19 cases, they’re actually able to go ahead and continue on with that phased approach at a faster speed, which makes a lot of sense. Note here on the right side, the county Pend Oreille, we’re gonna come back to it in a second here, okay? So let me do some additional exploratory data analysis. I’m just gonna focus on a few counties. In this case I’m gonna focus on Kings, Snohomish and Pierce, those are the three most densed counties and some of the not so more rural counties with less density like Lincoln, Garfield, Kittitas, Yakima, and Pend Oreille. Overtime, all of the case ratios are basically dropping, which is great to see, except for the Pend Oreille County, that’s a little weird.
I’m a little concerned. So that explains the case ratio bump that you just saw in the previous one. But let’s go look at the data itself. Well, if I was to run this particular query, and actually look at the actual case numbers, you notice what actually happens on the 23rd, there are two cases and then boom, on the 24th and 25th now there are three cases, so it jumped up a little bit, okay? So it’s not great, that what we see here obviously. But, over, what you’ll notice is that from doing EDA, the reality is it’s not that little jump in case ratio isn’t as bad, okay? And fortunately, there’s no deaths in that county. So that’s good. So now let’s go ahead and continue onwards,
and focus a little bit about the IHME model. Now, there are different COVID-19 models. These are known as SIR models that are specific to well, COVID-19. And so here’s a select few that you’ll notice from the 538 article, where the latest COVID-19 models think we’re headed and why they disagree on the projection of US fatalities. There are a number of different models and like IHME, Columbia University, Northeastern and so forth, and what you’ll notice is that there’s the values what it is, but there’s also the shaded regions to the right, and because the more as time progresses, the more variance is gonna be because we’re not assured what the overall numbers gonna look like. These models all take in account a lot of different factors, whether it’s when they shut down the schools, what phase they’re in,
when they shut down restaurants, limited travel, so forth and so forth. And so each of these models tries to take an account of these different pieces of information to see if they can have a more precision or better predict the number of fatalities overtime, okay? And obviously we’re hoping for them to drop. Well, in the case of the IHME model, this one is a specific model from the University of Washington, here in Seattle.
All right so, you can actually learn more about it if you actually focus on,
to their website here, and so I’m just gonna open that up,
and you notice basically here’s how the look at it and if I was to just focus on a particular state, I’ll just happen to choose Washington State again, right. Here’s its projections, here’s its first daily deaths. And what you’ll notice is that it also includes infections and testing, hospital resource utilization, and also factors like social distancing, like educational facilities closed, gathering restrictions, stay at home orders, business closures, non-essential business closed, things of that how they ultimately impact the mobility of folks, which also impacts the availability of COVID-19. Nevertheless let’s take a look at the models real quick, okay? And so if you look at IHME models themselves here, you basically note the fact that you can go to the website directly to get the models but we also store them here at Databricks datasets. So if you just go there, you notice actually the models are stored in Databricks datasets COVID IHME, and a bunch of them are already listed ever since March 27th in fact. And so from here, we’re gonna create multiple data frames. In this case, we’re just gonna focus on the 28th of April and the fourth of May. The fourth of May is important because there was a massive change to the model on the fourth of May, all right? And so when we run this, you’ll notice for example, the different models, okay? So on the 28th of April, if I was to look at May sixth, there’s a massive variance here. It makes sense because it’s actually greater than the 420 itself, because there’s less variance closer to when the model’s created and a week and a half out, it’s actually greater variance, okay? So all right, so you have the number of deaths is 870 but the predicted deaths
is around 849, or 848, but the lower bound is 794, while the upper bound is 1,020, so there’s a lot more variance going on. If I was to do the same thing for the five, four model, now remember it was the models created in five, four, so basically again five, six, so the variance is a lot smaller as you can tell, which is great. Upper bound 935 predicted that’s 920, lower bound 911 so the variance is a lot smaller. Makes sense because it’s closer to the date. But you’ll also notice that the death count is actually off. So the lower predicted upper bounds are actually above the actual death count. I mean, in that way in some ways is good, and the variance is a lot smaller in comparison, even when you go a little bit farther out. But it’s different for each different state, so for example, if I was to choose Florida here instead
and rerun everything, just from here, like I said, it will provide you these notebooks and we have a little tidbit that says when you change the widget, you can just rerun the cells from here. So for example with Florida, you’ll notice same concept on May six, we’ll get more go well, so that will compare the two models. The predicted and actual deaths are actually relatively close but the variance between the two is pretty massive between 1,213 and 2,195, okay? Versus the predicted deaths of 1,478 and 1,539. If I was to go ahead and look at the model here, you’ll notice it’s a lot closer okay, in comparison. So again, the variance is a lot smaller, but the predicted deaths and the actual deaths is actually closer together too. So different states is gonna act slightly differently, but what you’ll notice is that there’s probably room for improvement for the model. And this is where Scott is gonna go ahead and talk more about how he’s gonna apply polynomial interpolation with linear regression to actually improve the models by actually trying to predict the error of the models, and then applying the error back to improving the overall models and the overall numbers. And just like here, it actually will be different per state.
– So now let’s see if we can try and improve the IHME model. And how we’re gonna do that is we’re gonna by taking the difference between the predicted total deaths and the actual deaths. And we’re gonna take that difference and we’re gonna train a regression model to see if we can predict the error rate for future days. So again, so we’re not trying to predict the actual total deaths, what we’re trying to do is predict the error rate from what the IMHE model will give us versus what actually occurs. So a quick background on regression models. So each one of those points represents a training point, and so the idea behind a linear regression model is we draw a straight line and we’re trying to get that line drawn in such a way that it gets close to as many of those points as possible. And you can see here that you not only get close to a few of ’em, whereas there are some other points where it’s still pretty far away from my line. So with a polynomial model, we introduce ways to make that line bend and curve. So the higher degrees generally that means the line is more flexible as far as changing directions in order to again to get much closer to our training dataset. So we’re gonna see if we can use the same method to improve the predictions. So if you were just to set up our state and our degrees to allow us to flip quickly between what states and again, if we want a linear regression or a polynomial model.
What we do is we’re gonna take the predictions as of April 21st, we’re gonna use four training days so we’re gonna use the 22nd to the 25th as our training dataset. After that, we’re gonna take an additional 14 days, and that is what we’re going to try and make predictions on. So then we read in our, you know, IHME data, combine that up with our Johns Hopkins dataset, and that allows us to create our training dataset. So you can see on April 22nd, the model predicted 910 deaths, there is actually 893, so it overpredicted by 17. You can see on the next day 952 was the predicted deaths but it was actually 987, so it underpredicted by, you know, 35. And then on the next two days, it continued to underpredict. So this is what we’re going to train our model on. Again, we’re not gonna try and you know, train it on deaths or predictions, we’re going to try and train it to predict what this error rate on the deaths will be for, you know, the 26th, the 27th and so on.
So we take our data, wrap it in a Pandas DataFrame, we plot it. And so here we have our plot of our training dataset. So again, if it’s zero, that means there was no error rate, which is the model predicted the exact number of deaths that occurred. So again, we can see that you know, in the beginning it overpredicted and then it underpredicted and it continued to underpredict. So what we’re gonna do is try and draw a straight line either this way or that way that tries to get a straight line as close to as many of those points as possible. And so then, here we go, we just set up our linear regression model, and do our fitting and then we plot it and see and then there’s the line we get. So you can see it doesn’t really go through any of our lines. And in fact, there’s still quite a bit of a distance between some of our training set and our actual line. So let’s take this, let’s combine it with the next 14 days. So we’re gonna take the next 14 days, fit it to our model to make predictions, and then there we go. So we can say, you know days 14 up the routine. Clearly, as we go out into the future, our model is predicting that that error rate will only continue to get larger and larger, you know, to the point where we get towards the end, we’re predicting that the original model will be off by over 225 deaths. So we take this, we combine it with our training dataset and our predictions just again to give back a single consolidated view of our training dataset and our predictions.
So now let’s take this and let’s combine it with the data from the Johns Hopkins and let’s see how far off we were. So to do that, we have what we call our corrected predicted values. So basically we take the actual number of deaths that were reported, I’m sorry, we actually take the predicted number of deaths for a given day, either adding or subtracting what we think the error is gonna be, and then we call that our predicted corrected values, and then we lay the Johns Hopkins data on top of that, to see how well we did.
So you can see right here in the beginning it was, you know, the original predicted values were actually better. But as you can see, as time goes on, our corrected values actually are closer. So again, you know, the green line is the original predicted values, the orange line is the actual number of total deaths, and then our blue line is our corrected deaths, and so you can see, as we get further out on May seventh, you know, we’re blowing off our model way off by two, whereas the original predictions were off by almost 200. So, you know obviously, our model actually performed or corrected, you know, predictions are actually much closer than the original predictions. So let’s try this with a different state, you know, let’s try this with Georgia. So let’s come back up here, now let’s rerun this for Georgia,
and let’s see what happens. So we’re gonna skip down to our training dataset. So you can see here, it is actually very close, you know, the model actually was correct on that first training bit, the dataset, but as time goes on, it actually starts over predicting more and more and more so you know, a little different type of as than what we saw in Florida. So again, we do the same thing, create our linear regression model, there’s our line. This time it seems like that line is actually not too far off of the training dataset. Combine that again with our prediction values, we get a nice unified view and it says, you know, that actually looks pretty good. And we see out from, you know, the 17th or 18th day, we’re predicting that the original predictive values are actually off by almost 500 actually close to 600. So again, let’s see how well we do in this scenario. And so you can see again, clearly again, the blue line is our corrected predicted values, the orange line is what actually happened, and the green line is the original prediction. So you can see our corrected predicted values were actually much, you know much better. But as you can see, towards the end, you know, that gap is widening. So even our corrected values are actually getting farther and farther away than what actually happened. So in this case, let’s see if we, you know, switch it and do a second degree polynomial, and let’s see if we can even get even more accurate results.
So we come back up here,
and then let’s run it and let’s see.
So first let’s again, same training dataset. But you can see this time because we’re doing it even a two degree polynomial. We’ve started to get, you know, a little bit of a curve in our line and our line was already pretty close before, but now it’s even closer. So let’s skip down and compare this with our unified view. So you can see now that we’ve introduced some curvature into our model by doing a polynomial regression. You can actually see it start to, you know, accelerate quite quickly, you know, before we’re towards the end when we’re predicting to be off by, you know, 500 and 600 now we’re predicting beyond 2,000. So again, let’s come down here and let’s see what that looks like when we laid over with the Johns Hopkins data. So you can see that again, we’re kinda close, you know, in the beginning, but it doesn’t take very long at all, before our predicted values rapidly start to deviate from what’s actually, you know happening, to the point where we actually start predicting negative deaths. So you know, it’s kind of an example where even though in the training dataset, you know, a polynomial seem to fit better, just given the nature of how polynomials work, you can you know start to get some really crazy predictions to the point where they don’t even make sense and in the case of what we’re trying to do here. And then a linear regression model, even though just by looking at the training dataset, it doesn’t seem to fit very well. It actually performed far better than a polynomial in this case. So thank you for your time. And now we’d like to open it up for any questions.
Scott is a Solution Architect with Databricks focusing on the public sector. He has an extensive background in database management and data engineering initially in e-commerce and healthcare. For the last 10 years he has focused on helping state and federal governments solve their toughest data challenges.
Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.