Would you like to generate meaningful insights with your geolocation data? Are you trying to run these queries at Petabyte scale? Join this talk to understand how you can scale ESRI’s geospatial expertise with Databricks.
In light of the global coronavirus pandemic in 2020, we will take a look at how we analyze movement data and determine the impact of human movement during these times. We will showcase several key technical concepts in our talk — dimensionality reduction with geoindexing, leveraging Delta Lake for geospatial query performance, and quantifying the risk introduced by human movement using a Human Movement Index.
At the end of this session, you will gain a better understanding of how you can derive insights into human movement at scale, which is a repeatable pattern that is highly applicable across industries.
– All right, welcome to Spark Summit. First of all, I guess welcome to this session. A couple of intros. My name is Jim Young. I’m a business development lead and partnership lead for Esri in our commercial sector. I’m based here in Portland, Oregon and with me virtually today we’ve got Joel McCune, who represents our GOAI team and builds solutions that combine geography with artificial intelligence. So today we’re gonna talk to you about how we’ve combined the power of Databricks with Esri to build a COVID risk index based on human movement. So let’s get started and jump in.
So, we all know that location data’s everywhere. Whether you’re building a simple app that just finds coffee on a map or you’re trying to do something a little bit more sophisticated, like keeping a plane in the air, geography matters. A growing number of data scientists like you all are starting to see the value in applying that geographic lens onto your analysis, which is great. Because geography is really what brings context and understanding to your data, whether that’s through visualizing it on a map, whether that’s through building up location data to power explanatory variables, or even just bringing in contextual data to help put your data in perspective. It’s all geography and it’s sort of this additional lens.
And so although you don’t need a map to take advantage of geography, the map is a powerful metaphor for understanding because our brains evolutionarily are hardwired to understand complex 2D and 3D datasets spatially and physically. It’s very natural to consume data this way. So whether you’re trying to understand the impact of the weather on sales or trying to figure out where to put the next cell tower based on RF propagation or even how to plan a logistics suite, maps help you understand and decide and as Joel likes to say, the map is the original infographic.
– So when we look at this idea of geography matters, particularly in the context of a pandemic, the current crisis we have over COVID, COVID can’t be transmitted unless there’s person to person contact. Jim currently sitting in Portland and myself sitting in Olympia, Washington, there can’t be a disease transmission unless we are physically in each other’s presence. Obviously this is the reason why social distancing works but ultimately if we want to be able to understand how risky an area is, either in terms of going there or in terms of where people are coming from, we need to understand a couple of factors about this and so, we’ve been looking at this and examining existing social distancing metrics and one of the things that is challenging about this is many of these have been normalized. So what this means is it allows us to understand how well people are social distancing in say, the middle of western Kansas, as compared to New York City. But ultimately from a risk perspective, those are two very fundamentally different places because of the fact of population density and the number of interactions that are occurring. So when we look at wanting to quantify risk in the lens of a pandemic, we need to take into consideration a couple of different things. We want to calculate this risk index by taking into account the volume, the number of interactions, as well as the distance somebody has gone for that interaction because the farther away it is, there’s a higher likelihood of connecting two geographies that would not otherwise be related. – So when we started thinking about this risk index, our research showed that the current social distancing metrics don’t fully consider geography. In many cases, they’re normalized by population, which removes what Joel said describing this volume, it removes that population density from the equation. And also as this metric of distance, the further that someone travels, the riskier that is. So if I go to downtown Portland, that’s sort of one level of risk. If I travel to Sao Paolo, influencing that population that is distinct and unique from my population, that connectivity represents a whole higher level of risk. And most models also don’t even cover this idea of significant group clusters. So we’ve been focused on building this risk index that considers distance and volume of people moving. – So when we talk about these movement risk factors, this idea that there’s two things we want to consider.
Both how far away something has happened, in addition to the volume in which that interaction is happening. What we’ve polled on or looking at to be able to quantify this is human movement data.
More simply, we can call this just simply cell phone tracking data. Everybody has a cell phone and what you can do is get data that tracks the location of these devices. Now the way this occurs, and especially in the case of what we’re using here, Veraset data, this is background application tracking. So when you install, let’s say, a weather application, and it asks you, do you allow the application to track location? This location tracking then is what Veraset is using. Now as you can well imagine, this doesn’t represent everybody. Based on where it’s at, we can get around 8% market penetration. Now this varies a little bit depending on where you’re looking at and what time frame it is. And so, the best that we can call this is a representative sample, but as a representative sample it can convey a tremendous amount of information. And even though we’re looking at less than 10% of market penetration, this is individual records showing where the cell phone location is at and this is a lot of data. Of the magnitude of billions of records per day. So, with this in mind, we wanted to be able to understand where people are going from and where they’re going to, we wanted to be able to put it on a map so that we can understand it, but will billions and billions of records per day, we were trying to do analysis from the beginning of March. You can imagine, hundreds of billions of records. There was no really other way to do it than using Spark in a scaled environment in which case we’re using Databricks to be able to do this. – Okay, so this is essentially the general workload that we took, the approach that we took. The top three items here are powered by Databricks and our data toolkit, which again, sits natively in the cluster. The jar that gets loaded there is similar to our open source engine that powers things like Athena and AWS Athena and Presto but it’s been enhanced to be even more perform inside of data risk. And the bottom three here are powered by RJS. Essentially, we start from raw data, we apply this hexagon based index to generalize the data, we build up the summarized origin destination pairs for hexagon and then we bring that much smaller data set into Esri and we have pinned our demographic data, visualize, and ultimately publish an interactive dashboard. – So when we’re looking at this panel data, the raw data, what we’re looking at, as we were talking about before, is each one of these records is really nothing more than when did it happen, a unique identifier, the location of the data, and then finally, how accurate that location is. Because ultimately, there is a margin of error for how precise you know where a device is, so what this allows us to do is this gives us a starting point. Ultimately though, there’s a lot of work that has to be done to be able to understand the relationship of this data so that we can then ultimately get our index. And since there’s so many records, the first step that we want to do is to be able to understand the data in some sort of generalizable form. What we used is a hexagon index for this. This enables us to be able to group them based on an area that’s roughly the size of 2/3 of a city block in New York City, just to give you a rough idea of what we’re looking at and then from there, what this allows us to do is understand the relationship based on the origin and the destination. And in this case, what we refer to as the origin is where the device, and by proxy, a person resides during the night time hours and then everywhere that they go that is not during the night time hours, this then becomes a trip that they venture to. Specifically, we also examine how fast the device is moving because we don’t want to be looking at people driving down the interstate. Ultimately what we want to do is we want to understand the location of people that is relatively static because that’s when people are at rest and have the potential to be interacting with other humans in a different location other than their home. – Okay, so let’s get to the good stuff. Here’s what’s happening inside of Databricks really and our workflow in order to build up that risk index. Essentially, we take the raw data and we filter it by significant dwells where a device is seen multiple times in a given location. We bend those into these hexagons as Joel said and we’re doing this at level nine which is about a city block. We take those hexagons and we build it up, an origin cell, based on where they sleep and a destination cell and we total those up, so that at each hexagon, we have a cumulative trip the destination paired for all permutations. And from there, we calculate that risk index which is simply the number of trips times distance and finally, we output this much smaller reduced cleaning data set as a process table for use by the GIS and now, let’s take a look at the actual notebook.
So again, here is that multidimensional data in this interactive dashboard and I can see this is Detroit for instance, and I can see areas of risk, I can see contributions over here in terms of which of these different tapestry segments are contributing and I can just bounce around to a couple cities. I’ll go to Boston next. And we see a very different pattern. Each city is unique. Here are some clusters. There’s a large cluster down here which could be a risk. Let’s jump over to New York here. And here we see a story about Manhattan and we see these different contributors, but what’s interesting is let’s take these high rise renters and filter on that. So again, I can just filter again on the dashboard and show just different segments. Up here in the northern part of the city we see these high rise renters and there’s a large cluster up here. I may want to think about who are these people and as a policy maker or decision maker, how might I message to them? So we just look at this chart here or this little infographic shows who those high rise renters are from the segment. So we see median age 32, we see that they are relatively low income, much of which goes to rent. We see many single parents and we can just sort of explore who those people are that are occupying that cell and then as I said, make certain decisions or policies or messaging about how to reduce that risk.
So here we are back on the workflow and I guess what I wanna say is this analysis would have not been possible without that combination, magic combination really, or distributed processing on the Databricks front and the enrichment and visualization capabilities on the GIS. That’s really allowed us to take very raw, high volume data, and bring some meaning and understanding to the data and then we’re allowed that to be easily shared in a consumable form inside of a community. And that same work flow though or going from raw data to these visualizations, using the same methodology could be applied to a ton of industries, whether that’s telecom data or looking at movement data as Joel has done for things like retail openings and site selections. It’s really almost like a human weather pattern that you’re analyzing and I gotta say, I love the collaboration capability of Databricks and being able to build these notebooks up.
But if I’m honest, Joel’s the one that spent most of the time in the notebook and so, how was your experience Joel?
– So I think in closing, one of the more important things to emphasize, this is what Jim was alluding to is, and he always likes to laugh at me about this, is that I didn’t get into this because I was that interested in big data, what really interested me was I’m a geographer. But before that, the emphasis I always like to make is that if I can do this, you can do it, and here’s the reason why, is that I have a degree in parks, recreation, and tourism. I discovered geography almost by accident and then I kinda fell sideways into big data analysis because ultimately, what I do the most of, is I’m a geographer. I had a problem I needed to solve. I needed to understand where people were going from, where they were going to, in what magnitude, in a way that we could quantify risk. I was able to frame a geographic problem and I needed to solve it. The only real way to do this really was taking advantage of the scalable architecture, in concert with Esri’s technology to be able to distill this down into something meaningful that we can then take and put in the GIS to be able to add even more context to it. Cause ultimately the name of game is being able to take data and making information out of it and that really starts with being able to first understand the problem. So the emphasis I’m really putting on here is that I was able to do this. I’m not a Databricks expert. I will freely concede that. I started doing this about five weeks ago. And ultimately, I was able to put this all together and get something up and running. I didn’t do it alone. Jim obviously helped me a lot, but this really was an idea that came to fruition because I had a need and then reached out and found the right technologies and ultimately, the people to help me get over the humps to be able to do this. So really, this was the type of thing where the combination of these two is really greater than the sum of the parts because we have a scalable ability with the context of geography to be able to understand this problem in a very meaningful way. So, with that, thank you so much for your time.
Joel specializes in finding answers using geography, specifically deriving actionable information from geographic data. Almost all data has some geographic relevance. However, defining the geography in the right context to discover the correct geographic relevance, this is somewhat more challenging. Joel has spent the better part of his career working with Geographic Information Systems, GIS, to unearth information from these geographic relationships. As the size of data has grown, the demands on technology have exponentially increased and have had to evolve. This has led Joel into the world of big data to continue to apply geography, but at a much larger scale.
Jim Young is a business development lead for Esri focused on big data and AI. He is working with tech companies and developers to explore the use of location-aware APIs and spatial analytics in their products and apps. His passion is the intersection of physical and digital - focusing on computer vision, sensor networks and location services. A pioneer in mobile social networks, Jim founded location-based Jambo Networks before joining Esri. He earned a masters degree in GIS from Cambridge University and holds a bachelor's degree in history and economics from Southern Methodist University.