Using Bayesian Generative Models with Apache Spark to Solve Entity Resolution Problems (DeDup, Merging, Uniqueness) at Scale

Download Slides

As the size of data generated grows exponentially in different industries such as Healthcare, Insurance, Financial Services, etc. A common challenging problem faced across this industry verticals is how to effectively or intelligently identify duplicate or similar entity profiles that may belong to the same entity in real life, but represented in the organization’s datastore as different unique profiles. This could happen due to many reasons, from companies getting acquired or merging, to users creating multiple profiles or streaming data coming in from different marketing campaign channels. Organizations often wish to identify and deduplicate such entries or match up two records present in their datastore that are nearly identical (i.e. records that are fuzzy matches). This task presents an interesting challenge from the standpoint of computational complexity – with a very large dataset (> ~10 million) doing a brute force element-wise comparison will result in a quadratic complexity and is clearly not feasible from a resource and time perspective in most cases. As such, different approaches have been developed over the years including those that utilize (among others) regressions, machine learning, and statistical sampling. In this talk, we will discuss how we have used the Bayesian statistical sampling approach at scale to match records using a combination of KD-tree partitioning for efficient distribution of datasets across nodes in the Spark cluster, attribute similarity functions, and distributed computing on Spark.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hello everybody, my name is Timo Mechler, and with me today is Charles Adetiloye and we’re gonna be talking to you about using Bayesian Generative Models with Apache Spark to solve entity resolution problems at scale. Before we get started though, a little bit about us and our company. So Charles and I both work for MavenCode, and we are an Artificial Intelligence Solutions company located in Dallas, Texas. We primarily do training, product development and consulting service in sort of three major areas, that includes provisioning scalable data processing pipelines, cloud infrastructure deployment, development, deployment of ML and AI intelligence platforms, and then also working with IoT, Edge Computations and big data from Sensors.

Just a little bit about us and our background. Charles is our lead ML Patterns Engineer at MavenCode. He’s got well over for 15 years experience working with and building large scale distributed systems and applications. He’s worked with quite a number of companies ranging from startups to large fortune 500 companies. I lead our client engagements in MavenCode and also work with our solutions team. Prior to MavenCode, I had close to a decade experience working at energy commodities world as a analyst and strategist working with energy traders. All right, with that out of the way, what we first like to do, is really just define our goal and the problem that we’re trying to solve. And that goal is to really identify profiles or records that belong to the same user entity that may be occurring in multiple places or across multiple data sources. So for example, we’ve got listed here sort of three records that maybe three records in different database systems, and as you can see, they may be matching up, they may be duplicates, they may not be, but the idea is to match those records to link those records together, and to be able to sort of deduplicate any duplicates that maybe occurring. So let’s move forward and talk a little about why does data duplication occur in the first place?

Why data duplication occurs

So one of the reasons is that you just have complete profile information, complete incomplete profile information or attributes or even missing attributes leading to duplication. Sometimes you have got data coming from different sources, and then coming together in a new system. Take for example, one company merging with another or acquiring another company and try and combine it systems potentially, you will then end up having the same data from multiple sources and you’re trying to clear that up and remove duplicates. You might be dealing with data coming at different resolutions at different time periods and then accidentally ending up with multiple recordings at different times. And finally, it’s sometimes even happens that you just have unavoidable errors. Someone typing in a name or an address incorrectly, users making multiple accounts and systems, all these things that could happen and lead to data duplication.

Data deduplication challenges faced today

So some of the challenges that we face today when we’re trying to deduplicate data from different sources are, number one, there’s more and more and more data out there today being generated by more and more people and devices. And just the sheer size of the data sets that we’re dealing with increasing the complexity of the problem. Trying to combine legacy and more modernized IT systems and processes, for example, combining on premise systems with modern architecture in the cloud and trying to do deduplication and linkage there. And then frankly, just building a scalable system period of doing the deduplication is challenging in itself. How do you build a system that looks across millions of records in an efficient, scalable way? That’s also affordable and completes the task in a reasonable time. And finally, how do we handle situations we have in exact, or what’s known as fuzzy matches? For example, I’ve got the name John Smith with an h and then Jon Smith without an h. Those could potentially point to say the same record or they could be separate. So how do we deal with that? So those are some of the challenges that we face today.

Where do we see this problem?

And we really see those problems across a variety of different industries. Hospitality, healthcare, insurance, even in fighting fraud, social media, and perhaps something that’s more relevant today in community health and contract traceability, something that with the recent COVID-19 pandemic is getting more and more attention. So really, this is a widespread issue that we see across a variety of industries.

Current solutions for deduplicating data

All right, so let’s start talking about solution. Identify the problem and talk about our goal. So how do we go about deduplicating data? Well, we think about it one of the most basic or maybe naive things you can do is to say, okay, all right, I’ve got two data sources, and I’m just going to compare every record out of source A, or data source A with every record or a data source B. So that’s pretty easy to set up, right? But the problem is that you can’t, doesn’t really scale, does it? If you’re comparing every single record with every single other record. So the challenge is, unless this quickly becomes infeasible, as the number of records grows, plus, you have to do additional work to deal with fuzzy inexact matches. So people have realized this, and that look to machine learning, and other methods to try to improve this process. And so, looking at the different types of techniques, let’s look, I wanna talk about just a couple of supervised learning techniques. And I’ll just cover these sort of at a high level. I’m sure these slides will be made available afterwards, you can look at in more detail. But one of those that folks might be familiar with a Logistic Regression Deterministic method where really we choose our action attributes or columns first, and then we run our regression against our labeled data, we end up getting our regression coefficients which turn out to be our feature with a feature weights, the weights on our features or attributes. And finally, once we have those, we have our model, we run it against additional data, calculate a total weight, and if that weights above a certain threshold, consider a pair of records or a candidate pair of match. If not, then it’s a non match. So this is fairly easy to implement as well. There are lots of libraries and tools available that support logistic regression. The challenge here though, being unique training data. It’s a supervised approach and that training data may not be available as data set scale scales. The other thing is you may have to fine tune this. You may have to fine tune the weights to get the performance that you’re looking for. And going on to the next supervised learning approach. This one is a probabilistic approach that you may be familiar with called a Naive Bayes Classifier. Again, we also start here with our attributes and our in our features. But then instead of running a regression, the Bayes classifier calculates two types of probabilities. A U probability that attribute in actually non matching can they pedigrees and then an M probability that an attribute in a matching pair agrees. Okay, so you’ve got these two probabilities, U and M for every attribute. And based on those, you end up calculating your weights as well, matching non match weights. And once you have that, those coefficients, those weights, you apply those to the data. And again, once you have a total weight, if it’s above a certain threshold, it’s a match, like the pair’s match, If it’s not, it’s a non match. So this requires a little less tuning, perhaps than logistic regression and less intervention. But you’re still in the training data, right? You still need lots of training data may be hard to come by as you scale up, and this also has limitations as you scale up to larger and larger data sets to run efficiently and quickly enough. So, those are supervised approaches. Let’s now talk about one unsupervised approach. And this is also kind of fairly naive way to solve this problem using something called K-Means Clustering, that you may be familiar with. So, we take our our entire data set, our entire candidate pairs and split them into two sets, two buckets, matches and non matches, then we choose a mean or a center for each each bucket or have the algorithm do this for us. Then after that each candidate data point, is assigned to the closest center or mean, then the means and centers are recalculated, and this process repeats. The data points are assigned to the closest mean or center, and then centers recalculate. And this continues until there’s no more shuffling around of candidate pair data points at which point we say the algorithm has conversion and we stop. So really the only advantage here is that this, yes, this does not require training data, but there are some pretty big cons here. For one, and you might have already figured this out, it depend pretty heavily on what the initial mean or centers that you choose. So the number of matches or the performance may vary from run to run. And so the convergence at the end may actually may not be optimum. So generally, this is probably better suited for the initial assignment of partitions rather than matching up candidate pairs. But nonetheless, it’s an unsupervised approach for trying to link up data. Now that we’ve talked about a couple approaches, let’s talk real quickly about improving performance and running these algorithms. So current current solutions for deduplicating data. One way to improve performance is to introduce what’s known as a blocking key, to really help reduce your search space for a match. So let’s look at this example here that I’ve put. So I have decided here to, I’ve got a table of users, I’ve got the not first name, last name, the gender or sex, date of birth and the address. And I’ve decided to split that apart by the gender or sex into two separate tables. I’ve decided to use the gender as my blocking key. The assumption here is that that gender is probably something that we’re gonna have mostly right that it’s not going to be distorted, meaning that in all the records that exist, that’s generally a pretty high quality attribute that’s gonna rewrite most of the time. Then if I am given a new record of say, John Smith, that’s given that that’s gender male, I don’t have to search the entire record space, I can only search those records that are then male. So that that increases my efficiency quite dramatically. Again, but there’s a caveat here. Not everything makes a good blocking key. Generally categorical attributes are better, gender, perhaps birth month would be another one, strings base attributes may not be as good first names, last names, the key is for it to be a low distortion or attribute one that you’re pretty at a pretty high probability of being accurate. So that’s one way to improve performance. And then looking specifically at inexact or fuzzy matching, the way people have dealt with this is to actually use various string comparison methods to have a way to quantify how close to strings are. For example, how close to first names are, how close to addresses are, how close to last names are. So I’m just gonna walk through a couple of those string comparison methods. One that’s very well known is Levenshtein distance, which is really just a number of single character edits required to change the string A and string B. I’ve given a couple of examples there John to Jon pretty close only requires one edit John to Jack a little bit more involved requires three, related to the levenstein distance is to Jared distance which is another way to calculate the distance between two strings A and B. This one however being normalized between zero and one, is a little bit more involved and you can see that in this case Jason and Jasno, just one transposition, are the characters that is actually pretty close at 0.933. Then we look at common sub sequences. So for example, Mike and Michael is a pretty decent, common subsequence. Tim and Timo, also three letters there, they don’t have to be consecutive. And then something like Frank and Bob, isn’t very similar at all. So this also gives us an idea of close to strings are. And finally looking at just the phonetic pronunciation. Ashley with a y versus Ashley with two E’s. Just Leigh or Lee, the different spellings just to see how slim those strings are when they’re sounded out. So this is a way to again to to figure out how close to records might be when you’re dealing with inexact or fuzzy matching. Okay, having said that, I’d like to shift now into our approach.

Our Approach

So we decided to perform distributed record linkage with Spark using a Bayesian Generative Model. And I wanna give credit here where credit is due. This approach is actually based on some research that was done. So I’ve got, this a link at the bottom of the slide to the research paper and the work that was done, and also the resulting open source implementation of that research that we’re able to leverage and then build upon. So that’s the approach we’re able to take. Some of the key features and advantages of this approach that I just mentioned, that one of the best things is that it’s completely unsupervised, it does not require large amounts of labeled data, so that it will limit us in any way, it supports both categorical and string data or string attributes which allows us to use those comparison functions I just discussed. The uncertainties reserve between different stages of this approach, and the nice cool thing about using Spark is that it’ll scale across multiple compute nodes allowing us is for distributed computation and working with much larger data sets. So we’re gonna be able to get the scale to work with much larger data sets. Now, I’d like to just take the next couple of slides and describe the approach to this and then I’m gonna pass it on to my colleague, Charles, to talk a little bit more in detail and also show a demo.

Approach description

So very similar to other approaches I’ve described, we have to start with figuring out our attributes throughout the columns or features that we wish to use. The preference given the ones that are lower distortion, have lower distortion as well simplify the calculation. We’re going to split our data set into a number of partitions, where number of partitions should be proportional to the number of Spark workers that we’re gonna be using. And there’s a little bit of a trade off of having balanced positions meaning producing with the same number of entities and having entities that are close, that might be potential matches. So that must be balanced out to minimize the communication overhead. And finally, there’s a configuration file that has to be created, includes the miles hybrid parameters, definitions over similarity String similarity functions, their thresholds, sensitivities, and length of our chain and also the number of entities and records that we expect.

Just a real high level, quick overview of how this works. So for each partition, we pick a given entity that’s linked to record. Then for each of the attributes of that entity we figure out distortion indicator first, whether that attribute is distorted. If it is not distorted, zeros returned the attribute is copied into the record or if it is distorted, we draw a attribute out of distortion distribution and then we also turn whether we’ve seen that attribute before. And so then this goes on for every single attribute in that entity record link. And then once that’s done, this gets repeated for all the other entity record links that are across the different partitions. And as done, then things are shuffled around entities and records are moved around the different partitions. Summary Statistics are calculated, putting new distortion probabilities, and then the process may repeat a number of times entire process based on the length of the linkage sheets that you want. I’m sure that was a little confusing and very high level. I’m gonna pass it on to my colleague Charles, who’s gonna take it from here and talk about this in more detail and also show a demo how it works. – Yeah, so for us to be able to solve all these problems, we need to be able to model entity. And to do that we came up with a probabilistic model based on the reference implementation was done by Neil and his team, and big shout out to them again.

Probabilistic Entity Model Representation

So they built a bunch of frameworks to help Auto build this platform and extend what it does, which is open source. And we’re gonna be sharing the link for this at the end of this talk. So the first thing I wanna be able to do, is to represent the entity. So we need to do like a probabilistic modeling that represent a user profile. So we have the entity profile, every user will have a profile and that profile has a bunch of attributes. So those attributes are what defines the entity.

The next thing after that, is wanna be able to like partition things and group things that are similar together. That’s the first way of logically thinking about grouping entities that we intend to come up with. But the advantage of these approaches are because we’re doing probabilistic sampling. We may start with like a bunch of entities in the same partition bucket, but over time, after multiple sub sampling and run it through multiple iteration of this process, they end up in another partition which will probably more or less represent the kind of entity and your relationship together.

The next thing that we’ll look at is the attribute distortion, like what’s the variation of the attribute that defines the entity? And how is this like different from what we think or know it’s the ground truth?

And finally, we represent the entire entity as a record that we can basically combine together and that entity record with a user profile and the attributes defines the entity.

Implementation Pipeline

So on a high level, whenever you’re trying to solve the duping problem, or record leaking problem, the architecture looks more or less like this. You have to log all these data sources, common data sources where you have all your datasets silo, and sometimes entities in data source one and data source two may be duplicates. Or you may have multiple records spread across different data sources duplicated all across your enterprise. So what we’re trying to do is he fast thing we’ll try to do is to basically combine all the datasets. So we do a union. So we’ll create a global schema represents all the common attributes, and where the attributes overlaps, we’ll try to normalize it and basically represent them with a common name. So let’s say in database one, you have your attribute represented as first name, as foster underscore name, and then database geometers name. So we’ll try to like agree on the same tags that represent what’s common to like all these attributes. So the next thing after that we do is to do a union. So we combine the data sets so that’s a bit of bad state that’s Timo talked about earlier, where we basically have everything that are similar that look alike in a common color. Then the next thing I would do after that is to try to devise a word partition in this data. And that’s where the power of Spark comes into play. With Spark, you have distributed programming all lit. With Spark, we have this program in place where you can show things across different executor nodes in Spark. So our goal right is to be able to like partition our datasets so that we can basically push each of these datasets in different executor nodes. And for us to do that, we need to choose the right partition function that can allow us to evenly divide the data and shuffle them across these nodes. So our partition function in this case may be an attributes like Timo said earlier, that can allow us to evenly distribute records across all these executes on notes. And on one node on the Spark. Cluster is not doing more than the other node and eventually these officers as well because the way we’re doing these things will be able to like do sampling series of sampling and checking to see the particular record sets and match. Matching are similar to each other. And if they’re not, then we push them to another cluster, another node in a Spark cluster. And all this process continues over and over like that. So for us to do that, we would basically extended the partition function traits that basically defines the attribute that we wanna use to do this splits. And once we do that, we can create an almost evenly split data sets that we cannot push across all the nodes.

Bayesian Generative Model

So a little bit of a refresher, the Bayesian Generative Model, basically, is a way of doing probabilistic modeling. Once we have our entity that basically we’re trying to like, determine if they’re similar or not, we try to use the Bayesian approach. And what’s a Bayesian approach? The Bayesian approach is a way of modeling the behavior of an entity based on all the prior observations that we’ve had about entities, like, what’s the behavior of these attributes? Are these the likelihood an object is gonna be in a setting where based on what we’ve known previously about the objects and things like that.

So, mathematically, we can basically represent it with this equation where Xo is the things that we know before, P is the probability distribution, and x1 will be the posterior observation. That’s what would think, what we’re trying to predict in future about the entity.

So let’s use this case as an example. So Tim and I were colleagues and sometimes people try to like, send us emails or just send an information across for us and this is a representation of a property that the likelihood of like getting my name right or someone mixing my name for choosing him. So you can see the representation where I bought it. So my name is Charles, some of some people call me Charlie. So the probability of people calling me Charlie is a little bit high out there or the probability of someone calling me Timo is very, very low. Likewise, the probability of calling Timo, Charles is low. But the probability of calling Charlie, Charles is very high. So over time, where are these probability distribution, where we can observe and see the relationship and you’re likely between two attributes, and you start to determine in the attributes to refresh to the same thing, or it’s something that is not really similar. So in that case, we consider multiple attributes in parallel, and we try to see over time does the first name, does the last name, does the data base, does the address and all those things, strong enough similarity that we can use like as a dishwasher, is likely the same person.

So, we can represent these mathematically as a big matrix that you see right there. And this is what basically we try to fit into an RDD. And we try to use to like model the behaviors and the relationship between our entities and observe attributes to see if something is similar or not. So if it is Charles has a very strong correlation, and the matrix is logged in. But it’s simply considered a strong correlation along the diagonal. And you can see like all the diagonal, this correlation is a little bit low and things like that.

Partition partial function: KD-Tree partitioning

So our partitioning. So basically, once we partition our data, we start with like in big data sets, so we can represent all our entities as the blue dots that you see out there. We basically use KD to partition data allows us to partition our data sets a lot of different dimensions that everything that fall into a particular space or into a particular separation bucket can be grouped together. And we can shuffle those things to actually note in our cluster. So we start with the partition approach you wanna take and depending on where you’re trying to solve the problem, you can decide to go with a different partition scheme. But over time, we’ve found our dedicated true crews a lot more optimal for us for the kind of record linkage and deduplication problems we try to solve. So once we do that, we push all the datasets to the node. And each data sets is now like almost evenly distributed. We don’t guarantee that the datasets will be evenly distributed. But if you choose the right partition key, things get distributed evenly enough for you to be able to run your records linkage, I’ve already most back.

Entity resolution & attribute similarity measures

So this is a high level overview from the stock net cluster. So let’s say we have a five node cluster. And we’ve basically pushed our partition data set to like all the attributes and the entities all in one place. And these entities initially, they’re grouped together based on common categories, and that we use in that partition function.

So to zoom out a little bit, let’s look at node one in the cluster. So let’s say hypothetically, I have three entities in that cluster.

if we can zoom up a little bit, So this is a profile of a particular person. So we can see like these profiles that represent three different individuals like a bunch of attributes. So our goal is to try to see if these attributes, bear represents each of these profiles are similar enough that we can easily collapse them or say, okay, this guy is still the same guy. So right here, what we do is to extract the entities. So remember when when I showed the probabilistic model early on, where we have entities that represent each record that we have, so then we look at the attributes, then we’ll try to find a distance so dissimilarity distance. metrics that similar talked about earlier, and dependent on the attributes there are different ways of measuring the distance. You’re trying to measure the distance between them is Bob, the same name is Rob.

Ashley, with l-e. The same thing Ashley with l-e-y and things like that, and it could be geospatial as well, where you’re trying to find out if John Doe lived in this address is still the same John Doe live life 10 miles away and things like that. So we’ve been trying to do correlation by address, and things like that. So we have all these sets of attributes. And we’re trying to like see, strongly correlated the act. So we’ll run them through a process that basically checks and see how strong of a correlation we have.

So, once we do that, one thing I’ll find out that often after multiple sampling on in this process through statistical modeling process. processes, CMC like all the attributes are similar that cluster together and all the attributes and that are not. The basically stay out of the cluster. So the advantage of this process, that we’re building on is like, we can broadcast or push out entities that is not really linked in node cost one to like another node. So with that, maybe there’s a record that is similar to this and clustering can call but if not, it keeps getting shuffled all around through our cluster. So we have a bunch of similarity measures that we use, and some of them are levenstein generally, metaphor on my trading.

Attribute Similarity measures

And in some cases, we attributed the Geo-Spatial where we’re trying to see how facts with different entities are from each other based on the physical distance in real life. We have data sets, that has been pre prepared. I don’t wanna go through the process of like imagine and creating the master data sets that contains all the entities that we’re trying to do. So this data set basically contains a bunch of records that are like, pretty much duplicated. And we’re not trying to load it up in a data frame. So now you can see what’s in the data sets. So it’s basically a sample data sets thing.

It’s pretty much, I think it’s a data set that contains the first name, last name bit of bats address, postcode and I think that the data set is about it I think he’s from Australia.

So what we’re trying to do is to basically identify duplicate entities by looking at the attributes and try to like filter out duplicates, and be able to show you which entities we think are almost anything in real life or they’re represented slightly differently in the database because of the attributes that don’t really match it. So basically I loaded up the data frame from the Spark CLI.

So the next thing I’m doing right now is basically, I picked up the datasets, and I’m running it through the D dupe code there. Basically tries to look out for the duplicates.

We’re basically loading up the data that we prepared. And we’re running it through the Spark job that basically loops through the data frame that represents the object. I mean, they attribute to user attributes, and we’re gonna like, partition it and shove it across for all the different nodes in the cluster. Before that, we’ll look at like all the attributes to see if there’s any duplicates. So like the first name. We look for the attribute just alone the number of unique counts that we have, and basically have an idea of the count of the number of objects that we’re gonna go to.

The other thing that you notice is like, we try to like make things comparable enough in the data frame the other time. So like on the data on that, we try to represent it in a way that you can easily calculate the distance. So we remove the slash or the hyphen. And that way I can easily do similarity measure between someone’s good or bad, someone that was born, let’s say, January, one, 2020, and let’s say January, 10, 2020. So that way I can do quick similarity match. We do that a lot. You notice that on the postcard as well, and center would be wherever possible whenever we’re dealing with interference, we try to like break it down to make it comparable.

So it’s gonna take a little while. What I really haven’t been data set out there but it takes like about 10 minutes to like, run to completion. So, since this is pre recorded, it’s gonna go a lot more faster than that. So I think we have something out. So I’m gonna go, we’re using Google Cloud GCP. So I have the data sets in my Google Cloud bucket. We collect them, and they’ll have diagnostics that we used to like check out things are going on across regions, as well. And the type of linkages. So this is the attributes that we have. And the clustered attributes. So after running the organ for like 10 minutes, we basically identify similar attributes for Part Four entities and will cost add them together. So the next thing I’m doing out here is basically loading this up into a data frame. So I’m gonna just take one of the records, so let’s take a debt, okay. So you can see the output. So things are similar. They’re grouped together on the same line and there’s no similarity found. You can see it’s just one item. So these are like the record IDs. And the record IDs are unique to each entity. But what I’ve done basically found like all these entities have more than one record, and record ID grouped together, they all have a similar. They have some attributes, and the other algorithm thinks it’s basically the same thing. So the next thing I’m gonna do right now, is to just try to load up, I picked up the line item.

So I will just try to look up all these record IDs and we can look through them and see why we think it’s similar and why we’re recommending that the attribution would collapse together into one. So what I basically did was load up everything in a list. So you can see the list. Those are the record IDs, I would think they’re similar. Then I’ll go back to my original data sets and this is the original data set if you recall, this is a record record IDs, it’s all unique. All these other attributes represent an entity, or represent user profile, which is an entity but with things, some of these actually entities or profiles are duplicates. So this is the datasets I showed you earlier. So we’re gonna look through these data sets for the record IDs I selected. So let’s do that real quick. So this is some live IDs. So what we’re trying to do now is to see if for these IDs and try to load it up. I’m gonna take the record ID and basically look to this table to see what we’re looking at right now. So as you can see, I basically did a data frame filtering on the record I’d, I looked for all the attributes that picked for download. So you can see the last name. The last name looks similar. So as you can see, we picked up the record, and for the record that we picked up, we can add a profile. The first name, there’s a little bit of typo in some cases, the address is a little bit different, we have the address to mention, the postal code is similar, and the states are almost the same thing. So they have already been basically determined that this record in this profile all look alike. So I think that’s on the high level, and that’s the way the duping a record link work, and a lot of credit goes to Neil and for open source in the base implementation of this framework that we used and will build upon. So tomorrow we’re going to wrap it up and take it from there. – Okay, thank you, Charles. I just real quick couple words as we’re wrapping up. Why did we decide to use a Spark framework?

Why Spark?

What was the right choice for this particular approach? Now, one, Spark is a battle tested distributed computing framework that’s supported in quite a right environment, cloud, on premise, et cetera.

The one thing about the model itself, the various updates using on a given partition of records entities only depend on the variables on the same partition, which is really quite important. And that allows for the distributed parallel computation across the multiple worker nodes, and sharing variables when needed as a smart broadcast variable. By carefully selecting positions, we’re able to easily distribute our partitioned data sets on the spark execute or nodes, and then live leveraging really the full distributed power of Spark. And finally, it’s easily scalable framework that leverages Spark RDD and data frame for the efficient data distribution.

Just to quickly summarize our talk, so we ran a Base General model for data linkage and deduplication with Apache Spark. And that has kind of led us to include the following today. By partitioning large, distinct data sets, leveraging multiple nodes of Spark, we’re able to achieve scale where the larger data sets and decreased run time, more so than we’ve been able to do before. You know, we have support for inexact and fuzzy matching the string comparison and distance function, which is really, really important, since the data we deal with often is not a perfect match. And the great thing is we’re able to achieve acceptable match accuracy, despite this actually being unsupervised approach, which is really quite cool that we don’t need all this training data first. And as previously mentioned, we have that cross platform support by using Spark, we can use deploy this on major cloud providers, AWS, Azure, Google Cloud, and even on premise as needed.

With that being said, we wanna thank everybody for listening in today. I know we shared a lot of information. If you’re interested in record linkage, data duplication or just randomly have lots of data, please drop us a mail or visit us or follow us on Twitter. And again, a huge, huge thank you to the researchers and team that made this work open source, Neil Merchant and Rebecca Steorts and the rest of the open source community behind this project.

Watch more Spark + AI sessions here
Try Databricks for free
« back
Charles Adetiloye
About Charles Adetiloye


Charles is a Lead ML platforms engineer at MavenCode. He has well over 15 years of experience building large-scale, distributed applications. He has always been interested in building distributed systems. He has extensive experience working with Scala, Python on Apache Spark. Most recently he's been working on Machine Learning Workload containerization and deployment on Kubernetes.

About Timo Mechler


Timo is a Product Manager and Architect at MavenCode. He has close to a decade of financial data modeling experience working both as an analyst and strategist in the energy commodities sector. At MavenCode he now works closely with the engineering teams to solve interesting data modeling challenges