Geospatial pipelines in Apache Spark are difficult because of the diversity of datasets and the challenge of harmonizing on a single dataframe. We have worked over the past year to review different pipeline tools that allow us to quickly combine steps to create new workflows or operate on new datasets. We have reviewed Dagster, Apache Spark MLflow pipelines, Prefect, and our own custom solutions. The talk will go over the pros and cons of each of these solutions and will show an actionable workflow implementation that any geospatial analyst can leverage. We will show how we can leverage a pipeline to run a traditional geospatial hotspot analysis. Interactive mapping within the Databricks platform will be demonstrated.
Dan Corbiani: Hi everyone. My name is Dan Corbiani. I’m a data scientist at PNNL. Today we’ll be talking about creating reusable geospatial pipelines. Most of what I talk about today will be applicable to other domains as well, but there are some extra nuances for geospatial work, which we’ll cover in this talk.
So the goal of today’s talk is to really understand some of the pipeline options that are out there and some of the pitfalls that you may encounter as you develop some of these pipelines. There are just a lot of nuances with when you start to work with pipelines, and I’m hoping that the examples that we’re able to present to you today will help you avoid those issues.
So quick agenda, first we’re going to go through and talk about the value of pipelines. Pipelines almost always add complexity to your work, so, we really need to understand what we’re going to get for that complexity. Next, we’re going to talk a little bit about some of the available pipeline technologies that are out there. We could probably spend the next year talking about pipeline technologies. So we’re going to share with you the short list that we came down to and some of the requirements that we used to make our decisions. The ones that we used may not be the best for you, but you’ll understand our thought process a little bit.
Then I’ll deep dive into some of the implementations. So this is where I’ll go through some of the pitfalls that we ran into. There’ll be lots of code on the slides. It’s going to be pretty detailed, so if you’re a manager, I’m sorry, maybe drone that part out, fast forward through it. And then the next part, we’re going to go through demonstrations. It’s geospatial so I feel like I would be a failure if I didn’t [inaudible] some pretty maps for sure, and that’s what we’ll do inside of the demonstrations.
So first off, let’s talk about why pipelines. Why are we even having this talk? Pipelines add complexity, as I said, but they also offer a lot of value. They make things more explainable, they make things more repeatable. You can configure them, you can test them. There are a lot of business logic things that pipelines can provide, that just pure code inside of a notebook or a library don’t necessarily provide.
So let’s walk through some of these in pictures because I think pictures can tell more than me just waving my hands.
So the first thing I want to talk about is explainable. So, I don’t know if anybody else has had this experience, but people will fire up a notebook, they get excited, it looks clean on the side, and they’re like, okay, I’m going to go get some data. Oh, I need this other data so I’m going to grab dataset B. Then I’m going to merge them together, I’m going to do an analytic and a visualization. So they set up their notebook with headers, it’s really pretty, it’s like a report, they’re excited, they get going. And then a few hours or days later, they have 70 cells in the notebook, it’s all over the place. And the original vision of that pipeline has started to go away. And that’s not necessarily a bad thing. I mean, it makes sense. As you do things that are more complex, you might have more code.
But we really want to be able to capture that underlying structure of what’s going on. We want to be able to provide that workflow too on a whiteboard or something so we can start to discuss it and figure out what we can do, how we can augment it, what’s going on. So, a pipeline can help us quickly distill something like a notebook on the side that’s really messy and all over the place into a simple workflow on the right that we would talk to SMEs about.
The next thing that I’ll talk about is configurability. So these two workflows look very similar, and that’s on purpose, because maybe on the left, somebody asks for us to create a map using the COVID data from Johns Hopkins. And the Johns Hopkins data doesn’t have geospatial information with it. So we need to go get some county information from the US Census. So we’re going to go get those two data sets, we’re going to merge them together and we’re going to create a visual. So that’s great.
What if we want to swap out the Johns Hopkins data for the COVID tracking data? We want to be able to create workflows and pipelines that are configurable enough that we can swap that out and the overall structure looks exactly the same. And that swapping out can be really quick and easy. That’s what we really strive to when we move over into the configurability.
That configurability gives us a little bit more as well. It starts to compartmentalize our workflows into usable functions. And those functions can be testable, which is really handy. We do lots of unit tests, we do integration tests, we do tests all over the place. But one of the things that I find a pipeline helps us do is focus those tests around a usable block that an end user is going to really want to leverage. And once you start to do configuration around those, they become really powerful and really reusable.
So the next thing, and I’m going to try to point through this, I can’t but bear with me for a second. This is the general circle that I, half circle, whatever, that I’ve been seeing people do when it comes to an analytic question and a notebook. They start with an analytic question. How does COVID correlate with race or something like that. A good question. Let’s go answer it. So, how do people do it? They fire up a notebook, they go with it, they start to get data from different places. It’s a mess. They might run cell five and then three, and that’s the only way that it works. And they do a demonstration to their colleagues and everyone’s happy and they’re excited.
So, they realize that they have an answer, they have the base for it. So they’ll clean the notebook up a little bit, they’ve got their answer and they’ll make a presentation. They’ll copy the figures out, whatever it may be, and they’ll present it. And this is great, this is fantastic. Somebody got an answer to an analytic question that they posed. But I think we can do better. I think we can do one step further than this, and I think pipelines help us do this.
So with a pipeline, I think we start to follow this other outer circle as well. We still do the presentation, we still follow all of that same process. But once we get that answer, we decompose the notebook and we think about, well, what other functional elements are in this notebook that we could leverage in other places? So maybe somebody wrote something to get data from the COVID tracking project. Maybe they wrote a new analytic inside of that notebook. When we decompose it into those functional elements, we’re able to start a reusable usability cycle, which is what goes around the side.
So we take that function and we put it through a linter immediately because black catches a lot of errors and flake aid is super picky and picks up a lot of other things. And going through that linter process is really just a free peer review. But then we do give that function to a peer and they look through it. Does that analytic make sense? Is that something that other people will use? And we really talk about that little function.
Then that function once it’s gone through peer review, people know what it does. We put it through automated testing. And this doesn’t have to be super complicated, but it just makes sure that that function does what we think it’s going to do and we make sure that it doesn’t regress the next time somebody else uses it. Then we start to document that function. Make sure that when somebody pulls this up, they know what it does, they know what the input is, they know what’s going to come out of it.
And then critically, that whole cycle goes into a deployed library, that goes right back onto the workspace. So now when other people go on and they get their next analytic question, how does COVID correlate with something else, some other sector in the economy, they don’t have to rewrite the analytic, they don’t have to rewrite the piece that brings it in the COVID data. That stuff’s all available. It’s all been cleaned, it has testing so we know that it works. And I think that this is kind of the cycle that a pipeline helps you formulate.
So now that we’ve talked a little bit about why pipelines, now the question is what pipeline. There’s so many out there, so, let’s talk through what some of the requirements that we had for a pipeline. So, this might seem stupid, but first off, we want a working documentation. We found the documentation can go stale very quickly despite everybody’s best efforts. That can be an issue, that can cause a lot of frustration and challenge with the development team. So we wanted to make sure that the documentation worked, we could get through hello world, we didn’t stumble too badly.
The next thing that we wanted was we wanted to make sure there were at least a couple of hundred stars on GitHub. There’s a lot of really impactful stuff on GitHub that doesn’t necessarily have 37,000 stars on it. So we didn’t want to say that we needed too many, but it needed to be enough to know that there was a community.
And that leads into the active development piece. We wanted to make sure that there was at least a commit over the past month just to know that there’s either a Slack or Gitter channel, something there that there was a community that would help support us. And that we knew we weren’t going to get stranded on an island.
So these last two requirements that we had are kind of specific to us, kind of specific to geospatial. So the first one was we wanted to support our analysts by making sure that this workflow works natively inside of a Databricks environment. So there shouldn’t have been an extra server that needed to be stood up or anything else that needed to happen. We wanted to make sure that we could just give people Databricks notebooks and they could work with it and they could get going, they could create pipelines, they could visualize the pipelines.
The next thing is kind of geospatial, in that we wanted to make sure that this pipeline would work on a variety of things. So we wanted to make sure that it worked on things like graph frames because some of our algorithms work with graph frames. We wanted to make sure that it worked with the SQL API because the SQL API is fantastic with Apache Sedona. And obviously needed to work with core data frames. If it worked with other things that was great, but we just wanted to make sure that it was really flexible within Spark.
So let’s talk quickly about our pipeline technology shortlist. We’ve got four on here and they’re not really in any particular order it turns out. The first one we looked at was Pi Spark ML Pipelines. And that’s just because that’s been around forever, it’s deeply integrated into Spark. And frankly, it works really well. It’s very repeatable. It has a good user base. So we started there, we started playing with it.
Then we looked at kind of the Airflow and Prefect combination. So I’ve used Airflow in the past, Airflow has been gaining acceptance in some cloud communities. It’s an Apache project so it’s well supported. But there are some things about Airflow that just don’t really click with me sometimes. It’s sometimes a little challenging to use because of its nature of timestamps. I was very excited about Prefect because Prefect fixes a lot of those issues. In fact, it was made by the core developers of Airflow. So Prefect works really well, but has some of the same deployment pieces that Airflow has in terms of the scheduler and the UI.
So the other thing that we looked at was Dagster. Dagster is dead simple. It works on a command line, it works inside of Python, it has its own UI if you want to use it. But Dagster was very, very simple and worked inside of the notebook, which is why we decided to leverage it. I’ve used Prefect on other projects and it’s worked fantastically as well. So, please don’t let that steer you away from it.
So now I’m going to go into a deep dive of Spark ML Pipelines, and then also the Dagster Pipelines. There’s going to be a lot of code here. I think this will help those of you that are implementing these get past some of your initial pain points. So, Spark ML Pipelines, these work exceptionally well for functions that are applied to data frames. In their terminology, they would call these transformers. These were originally designed to prepare text for NLP analysis and that’s where the data frame in, data frame out piece comes from.
And in that, all workflows are linear. So I stole this diagram from their documentation because frankly, I wasn’t going to do a better job making it. And a pipeline has these transformers and you can chain one or more transformers together. They’ll take a data frame in like that raw text data frame, they’ll tokenize it, it’ll turn that data frame into the words data frame, and then you can put that through hashing and then you get future vectors. And this is great. If you have a set of data frames that you need to do to a quick transformation on, Spark ML Pipelines are great.
So let’s walk through how that might work. So, just a quick goal up front, our goal is to be able to create a custom transformer that’s going to accept parameters. So, they’ll accept parameters in the initialization, which is in the top left or top right rather. And then we want to make sure that I can accept parameters as part of the pipeline process, and that’s the bottom piece of code. So let’s walk through this, and again, there’s a lot of code here so I apologize.
So, in terms of a basic transformer, a basic transformer is very easy to create with Spark ML Pipelines. We’re super excited to use this. Very, very excited to see how quickly this came together. We can create this no value transformer, it instantiates from the transformer class. And from that, we just do a standard Python initialization. So we have the method that we’re going to use and the metric column that we want this no value transformer to work on. And then critically what we need to add is we need to implement the underscore transform method. And what this will do is this takes the Spark data frame in and it gives a Spark data frame out.
So, we try to capture all of our logic of the transformation inside of a different class. We found that that just works better for our workflows. So all this is doing is it’s taking in the data frame, it’s looking at the parameters that it was given, it’s processing that data frame and giving that back to the pipeline so that something else can take here. And this is awesome, this is your hello world for a Spark ML Pipeline transformer.
The next thing that we wanted to do though is we wanted to start enabling that parameterization. So we’re going to do parameterization over two slides here. So you’ll see that in this code base we changed a couple of things. The first thing that we did was we started to add the param class. So this is where we start to tell Spark that there are parameters associated with this transformer and you can edit those. So that’s where on line 343 and 344 you see we have a param that’s named method and we give it some sort of documentation, which is great because now Spark knows about that documentation, we can access it later. And then on 345 and 346, we start to set the defaults, because at this point, we don’t have defaults for that parameter. So we set the defaults based on the parameters thrown in [inaudible].
And then the next thing that we have to do, and this was one of the things that got us tripped up, is we need to use the getordefault method on the transformer class to get the name of the parameter that we want to play with. So, down there on line 354 and 355, you’ll see that we’re basically doing what we were doing before, but instead of self.method, we’re doing self.getordefault and then passing in method. Now those names need to be the same from the parameters, so whatever you call the self dot thing, doesn’t matter, it’s what you call the thing inside of the parameter. So that name method inside of the parameter on 343, that’s the one that needs to align with getordefault on 354.
So, we’re not quite there yet. This is where we kept getting tripped up. So, the second part that we need to do to make the Spark demo pipelines work with a transformation is we need to add the setParams method. And the setParams method is the thing that allows Spark to dynamically change the parameters as part of a pipeline transformation. There’s a keyword only a decorator that’s on this, this is just a fancy way of making sure that it always gets called with keywords, and that makes the kwargs stuff piece work with the private set method.
So, if you’re doing this and you want to make the transformations work with parameters, copy this code, this will work really well. The setParams piece is the only thing that you need to add.
So, in conclusion, just talking about when you initiate this, you need to make sure you have the init piece, the setParams so that Spark can change the parameters dynamically. And then the keyword only piece is what will make sure that you have the kwargs in a clean way. And what that’ll do is on the left where you see the pipeline.fit.transform, you’re able to pass in a dictionary that you can change the parameters on the fly after you’ve already created your model.
So what does this look like in practice? I’ll go through a more detailed example in a few minutes, but we instantiate the transformer on 45, then we make it as part of some stages in line 62 through 65. And then we’re able to create that model on line 73. We can then run explain stages, this is a very simple function that we wrote, that essentially just loops through each stage in the pipeline, prints the class that it’s using, and then it prints the name of that parameter, the description and any default value that is provided. It will also print out the value that it’s been changed to if that comes into play. But that’s been really handy, that makes explaining these super simple. You can define your pipeline way up into the top of the notebook, and then just say explain later and it’s very easy to understand exactly what’s going on.
So now let’s talk a little bit about Dagster Pipelines. Dagster Pipelines are a lot more complicated, and there’s a lot more that you can do with it. It’s not just Spark, you can do DBT stuff, you can bring in great expectations, you can run Databricks jobs, you can run a remote Spark instances. There’s a ton you can do with Dagster. It’s very similar to Prefect in that way frankly.
But with all that functionality comes some complications. So I find that the config in Dagster can be a little complicated, especially if you don’t let Dagster auto-configure it for you. I found that their display function that they had, that they pulled out in version 0.8 to print out the structure of the dag inside of a notebook can be extremely helpful, so we just brought that back in. We found out that everything must be in the pipeline. Where a PI Spark ML pipeline, we’re able to go back and forth between data frames and do that quickly and easily, a Dagster Pipeline was a little bit more rigid in that way. So we’ll walk through how to make that a little bit more flexible.
And then lastly, as I kind of alluded to before, we try to keep as many of our analytics inside of classes as possible. With Dagster, that provides a few more issues because of the way that it creates its structure. So you have to kind of do some interesting things around classes and typing.
So really quickly, a hello world process inside of Dagster has a pipeline that’s in the top middle, and we’re just going to run through a couple of functions there. And then we have these solids, which are the pipeline steps. So there are two types of solids that we’ve found we use most often. There’s the Lambda solid, which doesn’t have any extra arguments in it. It’s just take the argument, do whatever you want with it. But with the Lambda solid, you don’t get any context, which means you don’t get any logging and there are no parameters. So the next type of solid is just a standard solid that has the context piece that gives you access to different parameters and that gives you the ability to write the logs.
So now let’s make this a little bit more complicated. We’re going to add that first bit that we stumbled on is how do I get a data frame into this thing without being really complicated? So, we found that the quickest and easiest way we could do that was just to write this simple read temp table solid. And all this does is this takes the string name of a temp table that you already have in your environment. And this is very simple, this works extremely well in Databricks. It’s easy to configure and it just reads the table and it puts it in the pipeline and you can go on your merry way. You can set that default value to something that you want to reuse. Temp tables go away so you don’t have to worry about it. That works exceptionally well.
So how do I actually run this thing? That was kind of the next thing that we ran into. So, I’ve been used to like, let me write a function, I’m going to return something from the function and I’m going to get that. And that’s awesome. That’s not quite how Dagster works. The middle on the right, you see there’s config and response, and what we need to do there is we need to call the Python API to execute the pipeline. And that’s specific to Databricks because we’re running this inside of a notebook environment. You can run this in many other ways but this is purely for the notebook environment.
So, with that, we execute the pipeline that we defined above, the hello world pipeline, and we press in the configuration. As you can see that configurations can get a little explicit because you have a listing of solids and then the names of the solids and then config and then the name of the parameter inside of the config. The good news is that if you don’t pass the configuration in, Dagster throws an error and will give you the schema for it. They also have a great support for this inside the UI. It has basically a JSON schema that auto-populates it.
So once we run that pipeline, that response value isn’t the data frame at the end, even if you to try to return one. So you have to go into the solid that you want the result from. And originally, this was something that was a little challenging for us, but we’ve learned to adapt and really enjoy it because it allows us to drill into different parts of the pipeline and debug things in a really intuitive way.
So, on the bottom, you can just see some of the example output that you’ll get inside of a cell. You can see when the pipeline starts, it has timestamps. You can see what it’s doing. If you write log information, you’ll see that I wrote out reading table. And you can see how long these things take. And that’s one of the other real benefits of a pipeline, is if there’s a certain part of your pipeline that’s killing it, you can figure it out with a pipeline. You can drill into that function, you can fix it.
So, the next slide that I’m going to show to you guys is about custom types. I’ll go through this relatively quickly. If you’re doing custom types, I think that this slide will be really useful for you in the future. Our goal is to be able to encapsulate as much of the logic inside of the class as possible. And we do that so that if we make changes to the way that the parameters for the class, or we make changes to anything else inside of that, it’s all self-contained. And we can write clean unit tests against that class.
But we want to make sure that the parametrization of that class is dealt with in the pipeline because the pipeline should be in charge of allowing people to change different parts of it, allowing people to connect it in different ways. So we want to let the pipeline do what it does well and we want to let Python do what it does well.
So, what we found was that we had a couple parts of this where there was some bleed over between functionality. And those are dealt with with the static methods that I show on line 111 through 123. The first one was we need to instantiate this class somehow inside of a Dagster pipeline.
So, originally, we were doing that over inside of the solid, but then we had to do some different things in terms of getting the solid configuration value for the different pieces. And our concern was that if we changed the required parameters in the class, we would have to figure out which pipeline solids there were, and then we’d have to go change it over there and make sure that it was okay. And if somebody else instantiated it in some other way, we had to catch all of them. And we were just afraid that was going to lead to a lot of issues.
So we realized we could put this factory method inside of the class and pass in the Dagster context and do all of that stuff inside the class. This way we change it once, it’s done correctly and we don’t have to worry about it anymore. Everybody else can use it in the way that it looks on the left.
The two other pieces are kind of similar. So on line 122, we have the Dagster configuration. So the configuration things are on line 112 through 115, or 113 and 114 rather. They need to be defined as configurations inside of Dagster. Again, our concern was that if we changed the instantiation signature in the class, then we’d have to go find all of those solids and all the configurations and make sure those were all updated. We found that again with the static method here, it’s all in one place and we can go with it.
Lastly, the Dagster type, this is the thing that allows Dagster to move things between pieces and we felt like this was the best place to put it. So if you’re working with Dagster and you want to do custom classes, I highly recommend referencing the slide, I think this will help you out quite a bit.
I feel like I must talk about the Dagit UI and Databricks, because these are also available. So we touched on this display piece, so that’s how you actually get your results in Databricks. You can definitely orchestrate a pipeline locally. The blog posts that there’s the link forward is fantastic. They talk about connecting to different high Spark confinements like EMR and Databricks, you can do some great stuff there.
The visualization function I mentioned that was removed in 0.8.0. You can go and copy that. That works just fine. If you’re going to use it in Databricks, you need to do that fancy thing with display HTML and get the SVG. But that’s the code snippet that you can use for that. If you run these locally with the Dagit UI, you get a great UI that you can start playing with. So you can see all of your different functions, how they connect. You can drill into them, you can see logs, you can execute them. You can do a lot of cool stuff, but again, one of our core requirements was that we needed to enable our users to run this inside of a Databricks notebook.
So, in terms of some conclusions, I realized that was pretty quick. We talked about Dagster and PI Spark ML pipelines. So when might I use one versus the other? So, I feel like Dagster sometimes has complicated config, frankly. It’s a little bit challenging to use unless you’re being guided by its environment, but it’s very descriptive. On the other side, on PI Spark ML Pipelines, you get more complicated implementation. You ended up working basically with the Java API through Python, which adds some complexity to the linters, they get a little upset. So it’s really a pick your poison on that one.
Dagster supports a bunch of different formats. So you can fan in, you can fan out, you can merge, you can do error handling. You can do a lot of pretty interesting stuff with Dagster, where Spark ML Pipelines only offer linear processes. If it’s a linear process, it’s simple, Spark ML Pipelines work great. But if you need to do something more complicated, sometimes Dagster is the right way to go or Prefect.
Dagster works with any types of objects, where Spark ML Pipelines only work with data frames. And Dagster, everything must be within the pipeline, that’s what we’re talking about with the table. That’s a blessing and a curse. Whereas spark ML pipelines, you can easily move back and forth.
So when might you use these? Again, if you need some to merge or fan or do something really complex, or you need to leverage things outside of data frames like Sedona SQL, or if you’re going to use graph frames, or if you’re going to orchestrate something on remote systems, Dagster works really well for that. But if you’re simply just adding, removing or modifying a column on a data frame, that’s where I feel like Spark ML Pipelines can work really well.
And our lesson learned what we try to do just for us, is we try to do as little as possible within the workflow framework. We let the workflow framework do what it does best and we let it deal with orchestration and monitoring and parameterization. And then we deal with all of the actual things that are happening inside of that box inside of a class, and that helps those units [inaudible] and deploy them into other places. We’ve just found that that’s what works best for us, you might find something different.
So now let’s quickly jump into two quick examples so we can get to see some pretty maps. So the first example I’m going to go through is a Spark ML example. This is a simple one, it’s a linear process, and this is really about geospatial clustering. We picked a random open data set for this. We were picking some US hospitals. And all we want to do is go through, we want to assign a hospital to a hash. We want to take the beds, we want to sum that up by the hash cell, and we want to know if that hash cell belongs in a cluster.
So, there’s a lot of specific stuff to our library in here that I’m not going to touch on, but I want to just get the flavor of how this works over to you guys. So, we’re going to port the PI Spark ML Pipelines, we’re going to get some data. And then here’s the meat of the actual pipeline where we’re going to specify what metric is it that we want to cluster on, what are the transformers that we’re going to apply? So we’re going to take the lat long, we’re going to convert those into H3 hash cells. Then we’re going to zero out the hospitals that don’t have any beds. We’re going to coalesce on the H3 cells because there may be multiple hospitals in the same cell and we need to add those up. And then lastly, we’ll just do some clustering.
So what’s nice here is we can define those stages, and just linearly, we can create a pipeline, and then we can fit the data frame. And then from that model, we can use that explain function that I showed in the slides. And what’s great about the explain function is now we can quickly look through here, we can see, oh, these are the parameters, these are the defaults. So if we get an error somewhere, we at least know all the parameters this pipeline working with.
So, very quickly and easily, we can take that model, we can apply a transformation to a data frame, and then we get a data frame back, which is awesome. And we can show that, and that gives us some values. But because this is geospatial data, very few people get excited about tables. So, we can quickly turn that into a map. There’s a lot behind this map but it’s basically just a [inaudible]. And what this will show us is the H3 cells and which things the algorithm thinks are clusters and which things it doesn’t.
Not going to walk through that in much detail, but I think what’s really cool is maybe you have an analyst that looks at this and says, well, that’s cool but I think that the cell size is too small, or I think you need to change the neighborhood size, or perhaps you need to re-parameterize something else. So, one of the things the pipeline allows us to do is very quickly change that parameter. So we can use the same model, call the transform function, and then provide a different value for the resolution. And what you’ll see here is now the hash size is bigger. And this is fantastic for an interactive workflow where users can go through, they can tune this, they can get their map. And we’ve processed it through the entire pipeline, and we can do that for all the different parameters that are part of this.
So, that’s the end of that demo, and I’m going to switch over to the next one.
So now we’re going to switch and show a quick example of how we do a similar workflow inside of Dagster. Here’s a proof of concept that we used with the workforce area characteristics data. This is a data set that’s available from the census. It gives us an idea of who works where and what census blocks. Like most census information, there’s a metrics table and then the geospatial information is separate. So we need to join those together and create a map.
So again, I’m going to kind of quickly flow through this. You can slow down or ask me questions offline. To make the visualization work we need to graph this, so I just go ahead and install that. We do some stuff with Spark, and bring in all of the Dagster type things that we’re going to need, like a pipeline and a solid and things of that nature.
So this is something that we find is very common. People will start by defining their ingestion solids. So, here we show how to do a factory solid. This is a really useful thing when you want to parameterize your solids without adding a lot of extra complexity. So in this case, I’m just doing a simple caching method. So if somebody instantiates this with a specific cache key, they won’t necessarily have to recreate the solid. Here we do the configuration the way that I showed you before, and we just read the data and we have a little bit of business logic in here to get the data frames out. And we do the same thing for the census block. These are examples of solids that would end up inside the library and being reused.
What we find our analysts doing often is they’ll write simple solids like this, like, hey, I know that you hashed the values, but we need to explode the information into those hashes. So there’s a lot of nuance with H3 hashes and how that happens and how that happens with administrative boundaries. That’s beyond the scope of this talk. But an analyst may end up writing simple solids like this that’ll help align the hash with the underlying data.
So now we just do a configuration. This is where you may end up integrating something like a Databricks widget or something along those lines to say, hey, if you want to pick California to specific year, that’s where you go ahead and do. You can set the resolution. And then we end up with a pipeline function where we use the solid functions to get the solid, or we use the factory functions to get the solids, and then we can leverage those inside of our pipeline. This can make sense at this point, it’s not too bad, but a diagram says a thousand words. So we’re able to quickly create a diagram that says we took the wac_data and the block_data, we joined it, we hashed it, we distributed the values, we coalesced it and we normalized it.
So, we can go through that, we can run that the way that we had talked about in the slides. And then this gives us the data frame that we would expect. But again, like most things, a data frame doesn’t say all of it. So, what does this actually look like when we’re done? So I ran this before and I ran this in two different resolutions because that’s the point of a pipeline. And what I’m showing here is the total number, or the percentage of people inside of a hash that have a bachelor’s degree on this side versus that have less than a high school degree on the side.
And this shows some really interesting things that you wouldn’t necessarily pick up on a table. You can see that the Southern California and the central valley have a lower education for their employment, where in San Francisco might have a lot higher. But this often doesn’t get us far enough to answer the question. Somebody might say, well, what happens if I zoom in, what if I do this? So the pipeline provides that. We can go through and we can run that again with a different hash level, and very quickly develop a map that looks more like this, where the user can zoom in and they can see the specific regions that have a higher proportion of a bachelor’s or more education, or regions that have less than a high school education.
So I hope that was a reasonable demo that wets your appetite a little bit, and we’ll go back to the slides.
Cool. All right, well, thank you everyone for your time. I hope you found that informative. I know that was a really fast whirlwind experience. If you enjoy this stuff, you want to talk more about it, please feel free to reach out, join our big geospatial data meetup. We try to meet the first Thursday of every month at one. Shoot me an email, we’ll get you on the invite list. And I want to say thank you to all the other people that made this possible. We have a great team at the lab that’s been helping out, and Maxim at the world bank gave us fantastic input on this presentation as well. So, I hope this was helpful and I hope to hear from some of you.
Dan Corbiani is a Data Scientist and Solutions Architect who designs, develops, and deploys analytic solutions for research programs. His primary thrust area is the intersection of large-scale geospat...