Overhead imagery from satellites and drones have entered the mainstream of how we explore, understand, and tell stories about our world. They are undeniable and arresting descriptions of cultural events, environmental disasters, economic shifts, and more. Data scientists recognize that their value goes far beyond anecdotal storytelling. It is unstructured data full of distinctive patterns in a high dimensional space. With machine learning, we can extract structured data from the vast set of imagery available. RasterFrames extends Spark SQL with a strong Python API to enable processing of satellite, drone, and other spatial image data. This talk will discuss the fundamentals ideas to make sense of this imagery data. We will discuss how RasterFrames custom DataSource exploits convergent trends in how public and private providers publish images. Through deep Spark SQL integration, RasterFrames lets users consider imagery and other location-aware data sets in their existing data pipelines. RasterFrames fully supports Spark ML and interoperates smoothly with TensorFlow, Keras, and PyTorch. To crystallize these ideas, we will discuss a practical data science case study using overhead imagery in PySpark.
– Hi, Spark AI Summit. My name’s Jason Brown, and I’m a Senior Data Scientist at Astraea. Astraea is a startup in Virginia in the US and we are working on building a cloud platforms and web apps to enable data scientists and other analysts to discover, analyze, and work with satellite and drone imagery, overhead imagery. In this talk, I expect that the audience will be familiar with some of the basics of data science and Apache Spark but probably have never worked with overhead imagery in practice. At our company, our technology stack depends heavily on Apache Spark and in order to work with this kind of image data in Spark, we have built an open source library called RasterFrames to enable us to do that. And that’s gonna be a focus of a lot of my talk today. But first let’s talk about overhead imagery. You’re probably familiar with it if from nothing else, your maps app on your phone, it’s very useful way to get around, but there’s a lot more to it.
We have potentially this imagery as art. This is a print shop. You can buy these and hang them on your wall, and they all come from the Daily Overview Instagram account. Beautiful images there. We have this imagery as journalism.
It’s often very powerful for storytelling, and you can see dramatic changes and know when they happens.
It’s very useful and powerful way to explain and understand what’s going on in our world. And that kind of thinking about imagery is very common in defense and intelligence. There’s a long history of working with this kind of data in military and intelligence agencies for the same reasons to understand who, what, why, when, where, and how. Another community that has a long history of using satellite data in particular is Earth Science. That sequence of images that you see here come from a satellite called Sentinel-5, and it measures the concentration of pollutants in the atmosphere from space, which is pretty amazing. This is a series of images from the first weeks of 2019, and then the first weeks of 2020 around Wuhan in China. And you can see the dramatic decrease in nitrogen dioxide pollution in that area as they were under a very strict lockdown. So as a data scientist, you look at all of these images and these kinds of examples, and think there’s a great signal in this data. There’s a lot of useful information in here. And the other aspect of that is it’s big data.
And it’s big data in really all of the ways that we talk about the big data vis. The variety of this data is sort of staggering, not only if you’re coming from another working with other kinds of data like transaction data or whatever, it’s very different. But also even within that community, there’s a lot of different names that are used depending on folks’ background. There’s a lot of different file formats at play. There’s a lot of different ways that folks have published this data in the past. The velocity of the data is pretty astounding. There’s one mission run by European Space agency called Sentinel-2. It’s a pair of satellites that work together to image the entire land mass of the earth every five days. And that generates about six terabytes of imagery every single day. And there are also government operators of satellites in the US, Japan and other countries, as well as many private companies small and large. So in total there are hundreds of satellites that are taking images of the earth every day. And on top of that we have an unknowable number of drone operators that are doing this, taking these kinds of images. So from all of that velocity, you get very high volume of data. This chart that I’m showing you is a couple of years old and shows this exponential growth in the US NASA archives.
And at that point it was approaching 30 petabytes.
So there’s certainly a lot of this data and it’s becoming more and more accessible. In the US and in Europe, there are open data policies that govern most of the science data that they produce and collect. And there also very significant efforts in the community to create open standards for how this data is accessed and cataloged in the cloud, which is really exciting in terms of really democratizing access to this data.
So what about data science? All of the signal in these sequences and images, all of the imagery that’s available, at Astraea, we started looking around and said, what can we use to really exploit this data? And I’ll highlight one library that’s very important to how we do this, is called GDAL. It’s very mature, but it has all, it’s really a lot of low level operations on file IO and transformations on the data. You need to be able to do that to process this data, but in our view, it’s not enough to really do the things that data scientists wants to do. So RasterFrames comes about when we ask a couple of questions, one is, what makes this overhead imagery data special, and what affordances can we make that we can bring to bear for processing this data in generalists tools. And here, I’m talking about really Apache Spark.
So let’s answer that question, what makes this data special? Let’s start from the data scientist’s sort of mental model of any image if you’re using Python PIL, or if you’re working with the planning library, you pretty much have an array multi-dimensional array or a tensor of data that’s three-dimensional, it’s height and width, and a number of channels.
If it’s one channel, it’s gray scale, if it’s three channels, it’s red, green, and blue. So starting from that idea, overhead imagery is different in sort of how it’s published and what you find in practice. The images tend to be enormous, like 100 megapixels or larger.
In each channel is typically going to be much, much more information, 16 bits per channel. Channels may be stored in different files. There’s often many more channels. I mentioned Sentinel-2. It has both ultraviolet and infrared data that come to eight or 10 bands.
So on top of those differences, there’s important other factors to consider. The location that the image is describing is very important and is key to helping us understand how this data interacts with other data in our announce.
So we need to understand how that location data is described, and we need to be able to make sure that we track it closely.
The second thing is there’s a lot of other metadata attached to each image, things about when the image was acquired, by what sensor who processed it and so on. And then there’s some oftentimes some special metadata that help us understand exactly how to interpret the values in the array. So we’ll dig in a little bit more on that idea of location. To understand the location of overhead image, we need to know about two things. The first is projection or a CRS, and I recommend the video that I have linked here.
This poor gentleman is trying to take a globe and turn it into a flat map. And that’s basically what a projection or a coordinate reference system does. It’s basically an algorithm that helps us turn a 2D coordinate system point into a location on the surface of a 3D earth model. Unfortunately for us in practice, that’s going to be just a string that names the algorithm, and maybe gives a couple of parameters.
The next thing that we need is a slightly scary sounding thing called the affine transform. And basically all that does is tell us how to go from the tensor space, the erase space into our CRS, 2D coordinate plane. So we have those two things affine transform, and the projection we can go from the tensor space, like the pixel to arbitrary 2D plane to 3D exact location on the 3D surface of the earth.
And in general, we represent that transform. We can do that with just the Southwest and Northeast corner of the image, if we know which image we’re referring to.
So we take all those things into consideration and we propose in RasterFrames and implement the following data model.
If you look on the left hand side here, we have what sort of like a physical model of one really large image. This might be like 100 megapixel image, and it has four channels represented by the different colors that you see. And we’re gonna take that really large image, which is too big to really affectively process in a Spark task. And we’re gonna break it up into smaller sections and we’re gonna call them tiles. So each channel is gonna be broken up the same way. And each tile is going to be a value in a row of our Apache Spark SQL data frame with a different column for each channel. And then we have the associated metadata sitting in other columns that we can then reason over. And we propose that this is a very powerful data model. not just for one image, but in cases where you have perhaps many images that cover different times and slightly overlapping locations and other things that are not homogenous as those things are captured by the columns, we can then start reasoning on them in a really powerful way.
So the whole idea of RasterFrames is taking that data frame model and making it a practical reality. The first thing that we do is we wanna have that be practical reality for data scientists, and that means having a really strong Python API.
And we have a custom data source to read this data, which includes like under the hood GDAL as a way to read Raster data.
We have a tile UDT, a user defined type that basically carries the array tensor, the tensor or array of the image data, as well as the location data for that small section of the image. And in Python, when you bring that over to the Python driver, we see that basically as a Numpy ndarray, which is about what you’d expect. And then in RasterFrames, we have about 200 different column functions to operate on either the location or tensor elements of tiles and work with other related types.
And so we’ll give an example and we’ll give a code demo in a few minutes to show what that looks like.
We also have spark.ml transformers that help us use this tile UDT in our spark.ml pipelines. And then we have some support for writing out and visualizing a tile UDT and in particular that is important in IPython or Jupiter context.
And so that’s what we’re going to do. Next, we’ll show you a live demo of this. And if you go to my GitHub Gist, you can see the notebook code that we have for this as well as some supporting data files.
So here’s our demo notebook.
So here’s our demo notebook that you can find in that Gist, and what we’re going to do is we’re gonna read a dataset from Kaggle. It is a California wildfire incidents from 2013 to present. It’s one CSV file with 1,600 rows, and you can see the columns here. We have important to highlight a longitude and latitude, so a point location for the fire and then a number of acres burned. We defined above a function that’s going to basically help us create like a bubble around that point location that’s about the same area as this acres brand. You can see the other kinds of information that we have, some of it is administrative, some of it is about other resources that were used to fight the fire, and some of it is about the damage caused, injuries, fatalities, and number of structures affected by the fire. So an interesting data set, and we’re gonna try to find some satellite data to enhance what’s here.
So we have a few functions here to clean up a few things, and then we’re gonna select down just the 2019 fires here
to help this thing move along a little more quickly. So we’re gonna look up some satellite imagery using the my company Astraea’s Earth AI proprietary library.
The Jupiter lab notebook that we’re in right now is called Earth AI Notebook. You can get a free trial of that to run all of this code, if you want yourself on our website and we’ll have some details later. The first thing we’ll do is look for collections that are available, that we have maybe a dozen different satellite missions that we have data about. This one that we’re showing here is called MODIS MOD11A1.
It’s land surface temperature, and basically you can think of it as a thermal camera that’s pointing at the earth and it’s imaging the earth, I think, every day and night. So it’s quite quick in its revisits. So here we can see the bands or channels that are associated
with that mission, we’re gonna use the land surface temperature daytime channel and leave the others as an exercise for you to explore later, but you can see some of them are metadata related to the image and there’s also a nighttime temperature. So the next thing that we’ll do is for every fire that we have, we have that our geometry bubble, and we have a start date from our Kaggle data and a date that it was extinguished. So we’re gonna query this catalog and ask for any image from MOD11A1 that intersects with that bubble and is in between those start and extinguish dates and it has a low cloud cover. And then if you don’t want to worry about doing this or get into the proprietary libraries, I’ve saved this like result as a GeoJSON file, and that’s attached to that just as well.
So you can start from this point, if you’d like we have the catalog of data that we’ve read from earth On Demand is basically we have the unique ID from the Kaggle fire data. We have the date that the image was taken. We have some metadata that in this case, the projection of the data, and then we have the land surface temperature is here represented as a URL. It’s a link to a S3 object.
So the next step that we’ll do in our analysis is we’re gonna just merge our Kaggle fire data with these image URLs and the other image data. And finally we’ll get into Apache Spark and we’ll do spark.read.raster. We’re gonna read this merged data frame, and we’re gonna tell it to treat the LSTD columns, column of S through URLs as the images that we wanna read. So it’s gonna create a leisurely evaluated tile column from that path, and then it’s gonna store the path in the new file. So you can see our schema has this tile column. It’s a structure that has the tile, the CRS and extent as we discussed before. And then we have these camel case fields are from the Kaggle data. And then below that we have the snake case fields, which are from our Earth On Demand catalog. So information about the images themselves. So now our data frame is one or more row for each fire
depending on how many images we found. The next thing that we need to do and we’ll get into the details, but we use some of our column functions to limit the number or to limit our data frame to only the tiles that intersect with our fire bubble. And that’s what this df.geos field contains. And then we’ll see that we have 625 rows remaining in our data frame. Now we can quickly inspect the data frame. This is a part of our RasterFrames library interoperability with IPython. So our tile column is rendered as an image. The clear portions of the image you see are like where there’s missing data. They could be due to clouds or pixels that are over water. And also see we have some details about the fire, like the name and the date that it started and was extinguished. And then we have some other details about the location of the image and the date the image was taken. So this one was like the second day of the fire, this Hill fire.
We can also take a look at an individual tile in the Python driver. You can see that under the SQL row it’s just a masked array. It’s a Numpy masked array, and it’s unsigned 16-bit integer. A mass array, if you’ve never worked with them is basically an array with a binary array of the same shape alongside it that tells you where the values are missing.
So from this unsigned imagery, we wanna get the temperature data in Kelvin and in the metadata for MOD11A1, we can find that we multiply those integer values by 0.02,
and that’s what this RasterFrames column function will do. It’s gonna basically do a cell wise multiplication. And now we have a new tile column with that data. And then to finish up our example, we are going to make an aggregate across each fire. We have some of the fire columns that we’d like to group by and select, and then I’ll draw your attention to our aggregate. The rf_agg-stats is going to operate on that tile column, and it’s going to do for the across all of the rows in the group and across all of the cells, it’s gonna find things like the minimum value, the maximum value, home report something like the number of cells that had valid values and so on.
So we’ll do that calculation and bring it into a pandas data frame, and then show you the summary of that.
Just a moment here.
There we go. And so you can see we have some demographic information about the fire and then our maximum temperature readings. So that’s the maximum from all of the images that we have across any of the cells that inner pixels that intersected that bubble. That’s the maximum temperature that we have. And then we have some other details. Some of them are pretty spreads. We have some other details from the Kaggle data about some of the resources and affects of the fire.
So with that, we’re ready to wrap up, I think, and invite you to come check out our project.
It’s under a commercial friendly license, Apache 2.8 license in licenses Spark and under clip sound station, location tech, which provides some IP governance. You can check out the project’s website or look for us on PyPI. You can go to Astraea’s trial to get a demo of that, free demo of that notebook environment that we were just using. And we’d invite you to come with any questions or ideas to our Gitter channel or to our GitHub page for issues or look for requests. And thanks very much for your time. And we hope to see you on our Gitter and other places.
Astraea, Inc
Jason T. Brown is a Senior Data Scientist at Astraea, Inc. applying machine learning to Earth-observing data to provide actionable insights to clients' and partners' challenges. He brings a background in mathematical modeling and statistics together with an appreciation for data visualization, geography, and software development.