Chesapeake Regional Information System for our Patients (CRISP) is a nonprofit healthcare information exchange (HIE) whose customers include states like Maryland and healthcare providers such as Johns Hopkins. CRISP’s work supports the local healthcare community by securely sharing the kind of data that facilitates care and improves health outcomes.
When the pandemic started, the Maryland Department of Health reached out to CRISP with a request: Get us the demographic data we need to track COVID-19 and proactively support our communities. As a result, CRISP employees spent long hours attempting to handle multiple data sources with complex data enrichment processes. To automate these requests, CRISP partnered with Slalom to build a data platform powered by Databricks and Delta Lake.
Using the power of the Databricks Lakehouse platform and the flexibility of Delta Lake, Slalom helped CRISP provide the Maryland Department of Health with near real-time reporting of key COVID-19 measures. With this information, Maryland has been able to track the path of the pandemic, target the locations of new testing sites, and ultimately improve access for vulnerable communities.
The work did not stop there—once CRISP’s customers saw the value of the platform, more requests starting coming in. Now, nearly one year since the platform was created, CRISP has processed billons of records from hundreds of data sources in an effort to combat the pandemic. Notable outcomes from the work include hourly contact tracing with data already cross-referenced for individual risk factors, automated reporting on COVID-19 hospitalizations, real-time ICU capacity reporting for EMTs, tracking of COVID-19 patterns in student populations, tracking of the vaccination campaign, connecting Maryland MCOs to vulnerable people who need to be prioritized for the vaccine, and analysis of the impact of COVID-19 on pregnancies.
Steve Dowling: How’s it going, everybody? Thank you for joining our session today. My name is Steve Dowling. I am a data engineer and technical platform lead with Slalom Consulting, and I’m here with my friend Andy Hanks. Andy, you want to go ahead and introduce yourself?
Andy Hanks: Happy to, Steve. My name is Andy Hanks. I work for CRISP, which is the state health information exchange for Maryland, and I lead the data analytic team.
Steve Dowling: Awesome. Thank you, Andy. So we’re here today to tell you about our session, which is going to be how Slalom and Chris worked together to unlock the power of data during the pandemic.
So before we get started, I’m going to go through our agenda. And kicking off with the agenda, Andy is first going to go through the need for a data lakehouse in healthcare. I’m then going to go through the power of a lakehouse, dive into some of the platform architecture, talk through using metadata-driven pipelines, and then finally, Andy will finish off with using data to fight COVID-19. So, Andy, you can go ahead and kick us off and get us started.
Andy Hanks: Excellent. So the challenge of COVID-19 is that we inherited an organization with beautiful microservices based architecture, which meant that we could identify all the healthcare details about any patient. CRISP’s mission is to receive millions of rows of data from all the different hospitals and healthcare organizations around the state and pull that together. The only problem with that is that some of that data, things like real-time hospitalization data and our claims information were stored in remote sites. So when we were pulling it one patient at a time, we were very successful at doing it quickly. When we came under the need to start to match 10, 20, 30,000 patients at the time, it was really very difficult. So that’s when we started this project to build a lakehouse, and we were able to successfully pull hundreds of millions of real-time hospitalization records into the data lake. We were able to pull in millions of patient record information into the lakehouse. And then we were rapidly able to create a use case that matched together that data and produced a daily output.
So when we started off, we had a project that we started in the middle of April and they said, “Well, by early may, can you start giving us matched data on the cases so far?” And we rapidly set the architecture. It was all put in and dev test prov in three weeks. We hit that target of daily feeds to contact tracing.
But in the meantime they said, “Oh, can you do this hourly by the way, because cause we really want to be good at responding quickly to these outbreaks.” So we took on another large data feed with millions of rows of real-time lab data and pulled those into the data lake and use those as the trigger to cover contact tracing.
And just to give you a context of this, there’s the health information exchange. A number of people jumped into this and they took on the task of figuring out how to do the queries in these remote data stores, how to pull them together into Excel spreadsheets, and do all sorts of VLOOKUPs and come up with the answers. But we had five and maybe more people working 8 to 12 hours a day to achieve this.
And at the six, seven week point, when we actually got the hourly contact tracing working, and we’d taken all that business knowledge from their spreadsheets, these users said, “Oh, I can hang up my Excel VLOOKUP spreadsheets for good.” And there was a big party.
But that was the start. We started with matching about 30,000 cases. Now we’re up to 450,000 cases and the lakehouse has just run beautifully. The scaling has been excellent. And I think we’re seeing just slightly extended times from back when we were there, so not a linear scaling, and that’s just been fantastic and hand over to Houston.
And I’ll hand it over to you, Steve.
Steve Dowling: Awesome. Thank you, Andy. So now that Andy talked through some of the common health fair data challenges, I’m going to go through the power of having a lakehouse and how we decided to architect it.
So, going through some of the challenges that Andy said, data being everywhere, no common standardization. When we sat back and really thought about what was a good platform that we could architect to manage some of these common challenges in healthcare data, there was a couple of things that came to mind.
The first one was being flexible, lots of different sources, different source types, sources coming in from all over; how can we be flexible in ingestion? The second thing was scalability. We knew that data was going to basically be growing as the pandemic went on, more tasks were going to come in, more records are going to come in, and we needed our platform to be able to handle that extremely well, to scale up and scale down and to save money when we need to, but also to handle big data sets when we need to.
The next thing is going to be data transformations. We wanted this to be very, very integrated with SQL. So we wanted people that knew SQL that could come in, handle different types of data transformations, and could really make an impact without having to switch their skill sets too much.
And then the final thing would be capable. We wanted this platform to be very capable in the sense that it could do more than just query SQL like a typical SQL analyst was used to, or to also do things like provide visualization, location analysis, and a little bit of reporting as well.
So diving a little bit deeper into this, as far as flexibility goes, when we originally set up this lakehouse platform, we knew that there was going to be sources from all over, so we needed to make sure that we were going to be able to ingest them. So, some common sources that we saw at first that we were able to handle pretty easily with the lakehouse would be SFTP, Azure SQL, on-prem SQL, app insights, and then even ingesting some stuff like CRM systems from Salesforce and Microsoft Dynamics.
So with that system came a couple different file types, and everyone knows file types can be a challenge sometimes. But using the open platform with Spark and all the different connectors, we were able to go ahead and adjust CSVs, different types of compressed CSVs, JSON, Parquet, obviously the standard in Spark, and then we even got to mess around a little bit with COBOL and get some Legacy data to extract value and to join into some modern data, which really helped us grow our platform. So, thanks to open packages like Cobrix, all of this is possible and you’re not limited by just what your platform can ingest based upon the system that you have. That was really awesome.
Moving on to scalability. Scalability obviously with Databricks is great. As a data engineer, it’s usually the hardest thing that you build a workflow and then you don’t know how it’s going to scale up when you 10X or 20X those records. I think the really nice thing about using clusters and scaling up and scaling down is we said, “Okay, we’ll build this for 10,000 tests a day, but if tomorrow we get 100,000 tests, we can just go ahead and tack on compute to that, and we know that the run times are going to be exactly the same. So having that simplicity to scale up and scale down I think has really provided us an effortless transition to go on as the pandemic data has grown.
Moving on to data transformation. We knew that Spark was extremely powerful in handling big data, but I think one of the nicest things about Spark is obviously there’s different ways to use that API. So we have the Pi-Sparked API and we have a lot of familiar Python developers, but I think using the Spark SQL API was super helpful for us because we able to take folks that were more traditionally SQL analysts and bring them into our environment and transition them to data engineers. And they were able to do things like data blending, data enrichment, and I think probably one of the most important things is data de-duplication. So we were able to take sometimes 10 to 20 tests for a single person, find that common source of truth for the single record, and go ahead and push that down to our end consumers so that end consumers were only working with the best and the truest data. Which I think in an environment like a pandemic, that’s really critical to make sure that whoever is being contacted let’s say for a given test is the right person and has the right data and is not being sent multiple notifications. So that was extremely helpful to have the Databricks lakehouse platform for that.
And then as far as the capability goes, I think this is really where it Databricks shines. Like I said, as a SQL analyst, you’d be used to just being able to query data and just do roll-up tables. And it’s great to get insight out of that, but I think being able to do things in Databricks like data visualization is really powerful. So we were able to query some of our data sources and find data source problems pretty easily using data visualization. With that we could find the smoke in the data source and then investigate further to find the fire. So I think having that built-in capability surprised a lot of people with just even being able to visualize your data, what else is possible.
The second thing that I’m really proud of is location analysis. So I think one of the biggest struggles when the pandemic started was having data and having good data. We were able to get a lot of data that had very choppy address information, but we were able to use a Spark user-defined function to hit different geocoding APIs like Google, the census data, and then a bunch of different local APIs, and actually enrich on top of that data and take a little bit of a choppy address information and provide our end users with very solid standardized addresses, location, latitude. And I think this really helped the end users take that data and provide more value with it downstream and take more action against the pandemic.
And then finally, obviously Databricks is a compute platform. So you can attach any library on top of it for whatever language you’re using. We were able to take advantage of a lot of great Python libraries. Matplotlib is already installed in Databricks, and they were able to use stuff like O-365. Using this, we created some automated emailing reports, which was really helpful. Because Saturday mornings happened, and you wanted to make sure all your runs are going good because you’re running pipelines 24/7. So you’re just able to send out an automated email report. So instead of having to log onto the computer, check to see if your pipelines ran well, you can go ahead and just check your email, see that everything ran well, get a report, get a thumbs up and be good to go. So I think we saved a lot of our data engineers a lot of time in the morning by just saying, “Hey, we’ll get a quick report and that’s that, you can move on with your day.” So being able to have that flexibility in the platform to do whatever we wanted to creatively has helped us move along and helped grow the platform and helped get the team really integrated, which I think is great.
So that’s a little bit on the high level over the lakehouse. I’m going to go into our architecture that we specifically developed at CRISP. So moving forward here, we’ll go into the actual platform architecture. Now, I’ll talk to this from more of a high level tools and processes of what we did at CRISP, and if there’s any more detailed questions, I’d be happy to jump them during the Q&A session. But looking at the architecture, what you’ll see is that we have a various set of Azure tools off the bat. I’ll call out Data Factory. In this diagram, we use Data Factory going into all of our Databricks nodes. That’s just to communicate that Data Factory is actually orchestrating our Databricks jobs. We use Databricks for pretty much all of our compute processing, but it is the orchestration tool on top of it to kick off everything.
We also do use Data Factory a little bit for ingestion of sources. So if I go from left to right, on the left side we have all the data sources that we’re going to ingest, and these can be on-prem, public cloud sources, APIs. Pretty much anything under the sun that you can think of is what we had to ingest during this process to get good data. Data Factory, we’ll go ahead and pull that data and put it into what is called our landing zone. And our landing zone is going to be an exact replication of the on source data. So if we get a CSV on source, it’s going to be in the landing as a CSV. And this is really nice, because if anything ever happens we know that we have the source data there, we know that we can rebuild anything downstream by having a nice copy of our on source data.
Moving forward from the landing zone, we move to what is called the raw zone in our system. So the raw zone is going to be pretty much an exact replica of everything we ingest. But this is where we start moving things into Delta format. And this has been super helpful, because we’re able to take all of that raw data, parse it, whether it’s a CSV, JSON, Parquet, other types of formats, and put it into a Delta table and therefore now analysts can query it using standard SQL or write Pi Spark against it, anything that they would like to do, they can start picking up that data. So this is where we’ll continually append every single data that we get coming in every single day. And then the data will live in this raw zone.
From the raw zone we then transition to what’s called the process zone. And this is where I like to say the magic of the data happens. So if we have just appends going on in the raw layer, when we transition to the process zone we’re going to be doing our type two merge style dimensions. Also, any small transformations that happen will happen in this layer. And the point of this being is that if I’m a SQL analyst and I’m pushing back into the process zone, I want to make sure I’m always working with the cleanest, the most updated data, and this is where I’m going to be doing that. So I’m going to be pointing all of my queries back into the process layer.
So you’ll see in the process layer right above there is actually our job state DB. Really quickly to call this up, you can think of this as pretty much just a watermarks table. No actual data is stored inside of this database. We just keep track of every single file that comes into our lake from landing to raw to processed gets made an entry in this table and we track it throughout. So, if failures happen, clusters go down sometimes. If that ever happens, the next time that job runs, it’s going to pick off where it left off, and we’re not going to lose any data throughout our entire pipeline, which is great.
From processed we moved to curated. This is where if you’re a data engineer, SQL analyst or anything like that, this is where you’re going to spend 90% of your time. So this is where we actually take all of that processed data that we have automatically gotten into our process layer and we start working with it, making queries, making different tables. And then we push this data for consumption downstream to our end users and our customers. And this can come in the format as pushing files to SFTP, having dashboards in power BI, pushing tables to SQL databases using Spark connection to SQL. And we’ve even done things that are pretty cool, like update Salesforce using the bulk API, and even update Smartsheets using their API that are built in. Because Spark is just a really easy platform to work with data, and then we have Python on top. We’re able to push this data down to different APIs and let our end consumers consume the data how they feel best and most comfortable. So it’s been really nice having that option and that flexibility to do that, as I mentioned before, and help our end users consume data however they want to.
So that’s more of a high level overview of our platform. Moving on, I want to talk about metadata-driven pipelines. One of the biggest things for us was we knew that we were going to have a lot of data coming in and a lot of different sources coming in. So one of the thoughts that we had when we stepped back was how do we help data engineers get sources in all the way from landing to process as quick as possible? Just being a data engineer and working at previous organizations, I know that typically when a new ingestion has to happen, that’s going to be a two to three week process from when requirements come in, for when the new code is going to be built, connections set up, everything from that entire process that has to happen to get that source to where it needs to be where people can start working with it.
So what we decided to do for CRISP is create metadata-driven pipelines in the sense that once you create an ingestion story for Parquet, if Parquet is coming in in the future, I never have to create more functions or write any snippets of code to get Parquet in again. I already have that code written and standardized and parameterized so that I can just make entries in a metadata file, and the next time that I have to ingest a Parquet source, I can copy and paste the metadata, and then we will just ingest that exact source. So we really took ingestion times down from two to three weeks to two to three hours. If I’m a data engineer, all I need to do is edit metadata, and then I’ll be good to go, that source will land and process once it gets pushed up to production. And then I’m able to have my analyst work with it, or I can create curated use cases with it. So I think this is really been good for us as far as reusability. It’s standardized our processes. And then I think the biggest thing obviously is faster setup time. If it only takes you two to three hours to ingest a source, you spend a lot more time getting insights out of your data than you do just trying to get your data into one place.
The final point with this too, is that it’s built in documentation. I know as a data engineer, documentation is always the last thing on our mind. So it’s nice that we’re actually documenting as we go. Since we’ve ingested roughly a hundred sources at CRISP, we’re able to go in and look at our metadata files and see “this is how each data source is set up, this is each primary key for each data source, and this is how we take it all the way from landing to process. So I think as a data engineer, I’m extremely appreciative of that. It’s saved me from having to do documentation many times. So it’s something that’s really been super useful to us at CRISP, and I think as we move forward it’s going to help the data engineers out there a lot as well.
So that’s a quick snippet on metadata-driven pipelines and how they power our pipelines. I’m now going to pass it back to Andy and you can go ahead and talk about how we use data to fight COVID-19 at CRISP.
Andy Hanks: So the good news for me, this slide is, is I don’t have to explain these use cases too much, because I think most of you know about these use cases from the newspapers. And we talked originally about the very initial contact tracing use case, but I would say we still work that and refine that every day, even now a year later.
So three times a week, we meet with the contract tracing team, the management team that oversees the hundreds of contact tracers using Salesforce to reach out to those who have tested positive and give them instructions and help to prevent further spread.
So one of the things we started with was adding race and ethnicity to the records. So the electronic lab record that comes in, some labs don’t collect it. Some people don’t want to share it. But we can look back over two years of their hospitalization data and fill that in. And that’s really helped the planners to think about being strategic with regard to race and ethnicity, as they’ve rolled out the contact tracing efforts.
The next thing that came in, and it’s aligned to the hospitalization icon below, is that we were able to send the hospitalization information to contact tracing as well. So again, that real-time hospital information we’re able to gather, figure out who’s in the hospital. So a lab record comes in at two o’clock, I send it over to contact tracing at three o’clock. At that point, the patient may be not in the hospital. And then each day we send what we call a [true-up], and that tells you their hospitalization status at that point. So, if they got admitted the next day or three days later, in contact tracing it would pop up and say they’re in such and such a hospital, admitted on this date. And then we’d also give them the discharge date they went home later.
And we continue to add information like that. We added phone numbers from the CRISP master patient index. So again, the phone number that came in on the lab test that came in may have been mistyped, but what we do as we go back into our master patient index, say we’ll look at the latest phone numbers, and push those through the contact tracing. That’s been super successful. I think moved from something like 70% connect rate to a 95% plus connect rate, which is huge to get to people the first time.
And then we’ve continued to add value as folks have thought of new ways to identify the outreach. Obviously, now we have vaccination reporting data, both we’ve been able to pull that vaccination data in, and produce the reporting that helps those who plan, so those who are planning to roll out mass vaccination centers by location, by volume, they’ve been able to do that in part because we’ve been able to rapidly produce this reporting and analysis for them.
And then of course we’ve now updated the contact tracing use case with here’s the vaccination status of this person who tested positive. And they’ve gotten even to the next level now where we’re looking for people who are being reinfected, so they’re getting COVID again after 90 days. Thankfully, very, very few. And then we’re also looking at what we call breakthrough cases, where people have been fully vaccinated but still tested positive. Some of those people, they just test positive because they test positive and it’s not a new infection, and others test positive and there’s something going on. And it allows us to go sequence both the person and the virus, and continue to drive down the spread of this infection.
And across the board, we’ve been able to take very large volumes of data, so our lab data is billions of rows over the last five years; we’ve been able to pull that into the lakehouse and pull that together for research studies. And overall, we’ve been able to take that big historical load and then have a daily ingestion and keep all of these different sources up to date in the lakehouse so that we feel like we’ve got near real-time analysis, everything’s as of last night, and we’re able to run things through where we need to. For the hourly contact tracing, we reach back to the live data as well.
And I think just to finish up, here’s the volumes that we’re dealing with. And we often deal with these daily because there’s deletes in the data. And just a great project to be part of. And we’d love to answer your questions.