The presentation will discuss the need for and deployment of a Databricks-enabled Entity Resolution Capability at the Center for Medicare & Medicaid Innovation (CMMI) within the Centers for Medicare & Medicaid Services (CMS), the federal government agency that is also the nation’s largest healthcare payer. CMMI builds innovation models that test healthcare delivery and payment systems, integrating and parsing huge datasets with multiple provenance and quality.
The instructional narrative of the presentation will be on the Natural Language Processing methods used to handle the scale of the data and the complexity of assuring the accuracy of various data entities across the innovation models. We will explore the specific entity resolution use cases, the ML necessary for this data and the unique uses of Databricks for the federal government and CMS in providing this capability.
Brett Luskin: Hi, my name is Brett Luskin, and thank you for attending this talk on entity resolution. This presentation is meant as a basic introduction to the concepts that drive entity resolution. A few things about me. I’m a data scientist with NewWave Telecom and Technologies. My main focus for the past eight months or so has been building a custom entity resolution model for CMMI, the Center for Medicare and Medicaid Innovation, which is a branch of CMS, the Centers for Medicare and Medicaid Services. NewWave is a full service health IT firm. We have an over 15 year relationship with CMS providing an assortment of IT solutions, including data science activities, like natural language processing and machine learning.
In this project, we partnered with the fine people at Databricks. The project was spearheaded by Luke Bilbro, an expert at resolving entities of all shapes and sizes. And we had a ton of support from Qing Sun, a master of all things Databricks. I want to extend a huge thank you. I’ve really enjoyed learning from you both during our time together.
Let’s get started with entity resolution and I’m going to start by setting the stage with CMS. The Centers for Medicare and Medicaid Services is the federal entity tasked with running Medicare, Medicaid, and CHIP. This makes them the largest payer for healthcare services in the US. It also gives them the largest healthcare data repository in the country. So for example, the integrated data warehouse has over 62 billion records of patient data. And the chronic condition data warehouse is a two petabyte health research database. It’s really unlike anything else available in the private or public sector. Within CMS is CMMI, the Center for Medicare and Medicaid Innovation. This is the department that oversees modernizing CMS, making sure that we bring the best of what technology has to offer to bear on these huge datasets.
Maybe the biggest challenge in the healthcare space is being able to answer the question, what is value, and we use models to help answer that. In a perfect world we could deliver the best health outcomes at the lowest cost with the widest availability. But the reality is that we often find ourselves having to make trade-offs. Medicare and Medicaid exist to cover the gap and availability, but that still leaves us with a trade-off of outcomes can cost. So we need a wide range of measurements to try to maximize outcomes while minimizing costs. And we’re entity resolution becomes critical is that we’ll need to pull data from any database tracking health outcomes, match all the records referring to the same patients, and then also be able to match them to the appropriate claims records. With that context in mind, let’s get started with how it all works.
Here’s the definition of entity resolution, and I have two ways of thinking about it. First, the formal way: the task of disambiguating records that correspond to real world entities across and within datasets. Some of the fundamental tasks of entity resolution are deduplication, record linking, and record canonicalization. I also like defining it in terms of questions. So if we have a dataset, have we seen this thing before? How many distinct things are there? Is this record and this other record actually referring to the same thing? When representing our things, what sort of standard form should we give them? In order to answer all these questions, we get to do some cool data science stuff. As a quick outline, we need to reduce the complexity of the problem, because this is not a linear least scaling problem. We’ll have to use natural language processing and a method called blocking. We’re going to build a classification model to predict whether one record matches another record. And finally, because we want the final output to be groups of records and not just pairs of matches, we’ll have to use graph computing to resolve all the matches into related entities.
The first challenge to tackle is the complexity. The entire set of data we need isn’t the single records in our data. It’s all the pairs of records. So as you can see here, we start with four records. And if we’re trying to naively compare each record to the other, and we end up with six comparisons to make. And if we added one record, we would be adding four comparisons to make. Add another record, and we add five comparisons and so on. So if you have four records, there are six possible pairs, but with 4 million records, we’re up to 8 trillion possible pairs of records, but what’s the big deal? This is the era of big data and parallel processing. Well, if you could process 1 million pairs of records per second, it would take about three months to finish.
So we need a strategy for reducing the amount of comparisons that we’re going to make. There’s several ways to do this, and it’s a little bit dependent on your data. There might be exploitable structures in your data that make it possible for simple rules to reduce the number of comparisons you want to make. But the main blocking method to use is a hashing algorithm. You can use a hashing algorithm by itself or in conjunction with rules, but it’s unlikely you’ll want to use rules by themselves. Using these methods we start generating what we’ll call candidate pairs, pairs of records that we want to classify and resolve into groups.
The hashing algorithm we used for blocking our data is MinHash Locality Sensitive Hashing. If you come away from this talk with anything, I think it should be how powerful this algorithm is at efficiently finding the similarity between text, documents, and even images. While this won’t be a deep dive into all the intricacies, I want to cover the basics in this today to hopefully pique some interest around NLP and what an interesting problem entity resolution is, particularly because of how widespread it’s use case is. The steps we’re going to take in this algorithm are shingling and hashing and locality sensitive hashing. Shingling is the process of dividing up and tokenizing your text. The division depends on the size of your text, so with a name you might divide it up based on a number of letters. If you have a paragraph you might choose to shingle based on a number of words, and you’ll tokenize your shingles by assigning them an ID.
We could just take the Jaccard similarity at this point where we take the number of matching shingles of the two items and divide it by the union of the two sets. I’ll stress that there’s more than Jaccard similarity out there, and we actually use other types of similarity scores in our project for CMS. But I do want to keep this basic and I think Jaccard is something a lot of people are either familiar with already, or if you’re new to the topic, it’s simple to understand. You can see here that we’ve taken one of our patients and combined the first and last name values.
Here’s a matrix of the shingles representing both names, and this is only one way to do it. This is what we call trigram where we shingle based on three characters. You can use any number of characters and that will affect the similarity score. So this is an area where you have to be mindful about what type of text you’re comparing and what type of similarity score do you want to generate. So we have a 50% Jaccard similarity here. You might be thinking that seems kind of low. What’s the deal with that? Well, I like to think about the distribution of similarity scores. I’m going to get across the population. In this use case, the ideal distribution is bi-modal where we have things either extremely similar or extremely dissimilar. And I’ll point out that our data is imbalanced. Most of the scores are going to end up either zero or very close to zero, and this should make an intuitive sense.
The name Mike is not in any way similar to the name John, but Mike and John are both fairly common names. That’s why we can catenate the first and last name, but this disrupts our bi-modal distribution. All the Mikes and Johns will have some level of similarity to each other, but we can add an amount of distinguishing between which Mike and which John are in our data by adding the last name. We also want to catch all the Michaels with the Mikes and all the Jonathans with the Johns. And for people with very common first and last names, that’s why we’re going to do this across every feature. So now we have our shingles and our tokens, but we’re still working with the quadratic complexity. To help with the scale of the problem, we used MinHashing. We start by assigning a fixed length integer signature for each set of shingles.
The assignment is based on the index of shingles in each document. By doing permutations on the index, we get unique signatures based on the shingles. We can compare the number of components that are the same in each set of signatures and divide by the signature length, and that will approximate the Jaccard similarity. There was a probabilistic way to think about it as well that might be easier to digest. When given the union of the two sets of tokenized shingles, what is the probability that if shuffled, the first item in the resulting list will be one of the shared objects in the set? By doing permutations, we can make sure the signatures are unique. The more permutations you have will guarantee uniqueness, but add runtime. So you need to balance the chance of signature collision with the runtime when assigning how many permutations to do. Using men hashing, we’ve reduced the complexity of the problem while still maintaining the similarity of the signatures. But there’s one more step to take to help identify the candidate pairs.
The last step in this algorithm is the locality sensitive hashing part. In my toy example, we have a signature matrix that is two by two, but as we do this across the full name and all our features, it will end up much larger. We divide that signature matrix into bands, and for each band we hash its portion of each column to a hash table. Each hash table will have a set number of buckets. Candidate pairs are the bands that hash to the same bucket. There will be some tuning to do in dividing up the bands appropriately so that we maximize the amount of similar pairs we catch while minimizing the non- similar pairs.
So this happens feature-wise across the data. I hope it’s clear why data cleansing, pre-processing, and feature engineering are critically important. I’ve mentioned Jaccard similarity, and this is a parameter that can be tuned and will need to be adjusted depending on your data. If you consider a piece of data like social security number or driver’s license, these are unique identifiers that we want to match perfectly, while a name or address we can allow for some amount of differences. So what you’ll end up with is a matrix of similarity scores between two records. For classification, the first thing to know about the data is that even after blocking, this is an imbalanced dataset where most of the records are not matches. When building the training dataset, the simple way to deal with this is to over sample for matching records. For our classification model, we’ve been working with gradient boosted trees. This is a point in our project where we’ve been working exclusively in development, so all of our methods have been with synthetic data. We’re currently at a stage of building our training dataset for our production environment.
We use the graph frames connected components package for resolving our groups. This requires an input of two data frames, a nodes data frame, where every unique patient ID is a node, and an edges data frame, where two nodes are in a row with an edge represented by a match score between zero and one of those two notes. One place that can become difficult here is transitive linking. Most of our demographic data is not very unique, race, gender, age. If you live in apartment building, or if you’re homeless or live in a shelter, then even address may not be very unique. People in assisted living facilities often switch apartments within the facility based on their needs. And so with transitive linking, if you have three records and A and B have a strong relationship, B and C have a strong relationship, but A and C have a weak relationship, what’s the correct thing to do?
We’ve put the model in a tough spot, but in most cases, these will end up as all grouped together. So if the data quality is poor, let’s say we just have an address, race, gender, and age. We can end up with clusters of patient records that shouldn’t be part of the same group. We can mitigate this by setting some minimum requirements on our data as to the amount of information that is available. And the other thing is it can be a great area for the act of learning cycle in our classification model if you’re finding that it’s making the same mistake frequently.
To circle back to the project specific to CMS, one of the big challenges is that we can’t use real data in development. Instead, we have to develop synthetic data. Now we know what a good clean record of patient data will look like, but it can be hard to guess all the weird things people might introduce into the data. When we generated the synthetic data, we tried to introduce as many errors as we could think of into the data while also keeping in mind that there would be things we did not expect when it came time for production data. It makes pre-processing and cleansing that much more important because this will have a significant effect on the candidate pairs being generated, which will have further downstream effects on the runtime and performance of the model.
Here’s a reduced representation of some of the synthetic data we used in development. After generating patient demographic records, we introduced nulls, typos, name inversions, and replacements into the data to hopefully account for some of the errors we might expect. With the synthetic data, we generated 300,000 records. If we were to use the naive approach of comparing records, this would be about 45 billion candidate pairs. By using a blocking approach, we ended up with about 4.8 million candidate pairs. The number of candidate pairs generated has a relationship with the data quality. With cleaner data, there’ll be fewer pigmented pairs generated as it’s easier to tell the difference between patients with more information available. Because we knew we weren’t accounting for every type of bad qualities that could be produced. We would expect the 4.8 million candidate pairs generated with synthetic data to be higher when it came time to use production data.
When we received the production data, it was worse than what we imagined. There was a lot of cleanup to do, particularly text cleanup in the address field. In names, we had to remove things like John and Jane Doe, and in social security numbers, we had to remove when it was all the same digit or 123456789. This is pretty standard fare in data science as most of you know, but after we did all that, we were able to reduce our candidate pairs to 4.1 million. Now we have to ask, is this good? We were expecting there to be more based on our synthetic data testing. So our current development is based around creating a framework for benchmarking. We’ll need to test for precision and recall on our blocking algorithm, which we’ll do through random sampling. And we’ll also develop our training set, which we were unable to do with production data beforehand. Even though these are time intensive tasks, I’m excited is this will soon be leading to a critical component in helping CMS fulfill its mission in healthcare.
Thank you for attending my talk about entity resolution. This was a very basic introduction into the topic, and I hope it sparked some interest if you’ve never thought about it before. It’s a much deeper problem than what I presented here. I think you’ll have a lot of fun exploring it further. With all that said, I’ll turn it over now for some Q and A.
"Brett Luskin is a Data Scientist at NewWave, a full-service Information Technology (IT), Data Management and Business Services company delivering state-of-the-art Healthcare IT solutions, products, a...