New Solution Accelerator: Customer Entity Resolution

Building an ML-based Customer360 with Zingg

Published: August 4, 2022

by Luke Bilbro, Sonal Goyal, Bryan Smith and Mimi Qunell

A growing number of customers now expect personalized interactions as part of their shopping experience. Whether browsing in-app, receiving offers via electronic mail or being pursued by online advertisements, more and more people expect the brands with which they interact to recognize their individual needs and preferences and to tailor the engagement accordingly. In fact, 76% of consumers are more likely to consider buying from a brand that personalizes. And as organizations pursue omnichannel excellence, these same high expectations are extending into the in-store experience through digitally-assisted employee interactions, offers of specialized in-person services and more. In an age of shopper choice, more and more, retailers are getting the message that personalized engagement is becoming fundamental to attracting and retaining customer spend.

The key to getting personalized interactions right is deriving actionable insights from every bit of information that can be gathered about a customer. First-party data generated through sales transactions, website browsing, product ratings and surveys, customer surveys and support center calls, third-party data purchased from data aggregators and online trackers, and even zero-party data provided by customers themselves come together to form a 360-degree view of the customer. While conversations about Customer-360 platforms tend to focus on the volume and variety of data with which the organization must work and the range of data science use cases often applied to them, the reality is a Customer-360 view cannot be achieved without establishing a common customer identity, linking together customer records across the disparate datasets.

Matching Customer Records Is Challenging

On the surface, the idea of determining a common customer identity across systems seems pretty straightforward. But between different data sources with different data types, it is rare that a unique identifier is available to support record linking. Instead, most data sources have their own identifiers which are translated into basic name and address information to support cross-dataset record matching. Putting aside the challenge that customer attributes, and therefore data, may change over time, automated matching on names and addresses can be incredibly challenging due to non-standard formats and common data interpretation and entry errors.

Take for instance the name of one of our authors: Bryan. This name has been recorded in various systems as Bryan, Brian, Ryan, Byron and even Brain. If Bryan lives at 123 Main Street, he might find this address entered as 123 Main Street, 123 Main St or 123 Main across various systems, all of which are perfectly valid even if inconsistent.

To a human interpreter, records with common variations of a customer's name and generally accepted variations of an address are pretty easy to match. But to match the millions of customer identities most retail organizations are confronted with, we need to lean on software to automate the process. Most first attempts tend to capture human knowledge of known variations in rules and patterns to match those records, but this often leads to an unmanageable and sometimes unpredictable web of software logic. To avoid this, more and more organizations facing the challenge of matching customers based on variable attributes find themselves turning to machine learning.

Machine Learning Provides a Scalable Approach

In a machine learning (ML) approach to entity resolution, text attributes like name, address, phone number, etc. are translated into numerical representations that can be used to quantify the degree of similarity between any two attribute values. Models are then trained to weigh the relative importance of each of these scores in determining if a pair of records is a match.

For example, slight differences between the spelling of a first name may be given less importance if a perfect match between something like a phone number is found. In some ways, this approach mirrors the natural tendencies humans use when examining records, while being far more scalable and consistent when applied across a large dataset.

That said, our ability to train such a model depends on our access to accurately labeled training data, i.e. pairs of records reviewed by experts and labeled as either a match or not a match. Ultimately, data we know is correct that our model can learn from In the early phase of most ML-based approaches to entity resolution, a relatively small subset of pairs likely to be a match for each other are assembled, annotated and fed to the model algorithm. It's a time-consuming exercise, but if done right, the model learns to reflect the judgements of the human reviewers.

With a trained model in-hand, our next challenge is to efficiently locate the record pairs worth comparing. A simplistic approach to record comparison would be to compare each record to every other one in the dataset. While straightforward, this brute-force approach results in an explosion of comparisons that computationally gets quickly out of hand.

A more intelligent approach is to recognize that similar records will have similar numerical scores assigned to their attributes. By limiting comparisons to just those records within a given distance (based on differences in these scores) from one another, we can rapidly locate just the worthwhile comparisons, i.e. candidate pairs. Again, this closely mirrors human intuition as we'd quickly eliminate two records from a detailed comparison if these records had first names of Thomas and William or addresses in completely different states or provinces.

Bringing these two elements of our approach together, we now have a means to quickly identify record pairs worth comparing and a means to score each pair for the likelihood of a match. These scores are presented as probabilities between 0.0 and 1.0 which capture the model's confidence that two records represent the same individual. On the extreme ends of the probability ranges, we can often define thresholds above or below which we simply accept the model's judgment and move on. But in the middle, we are left with a (hopefully small) set of pairs for which human expertise is once again needed to make a final judgment call.

Zingg Simplifies ML-Based Entity Resolution

The field of entity resolution is full of techniques, variations on these techniques and evolving best practices which researchers have found work well to identify quality matches on different datasets. Instead of maintaining the expertise required to apply the latest academic knowledge to challenges such as customer identity resolution, many organizations rely on libraries encapsulating this knowledge to build their applications and workflows.

One such library is Zingg, an open source library bringing together the latest ML-based approaches to intelligent candidate pair generation and pair-scoring. Oriented towards the construction of custom workflows, Zingg presents these capabilities within the context of commonly employed steps such as training data label assignment, model training, dataset deduplication and (cross-dataset) record matching.

Built as a native Apache Spark application, Zingg scales well to apply these techniques to enterprise-sized datasets. Organizations can then use Zingg in combination with platforms such as Databricks to provide the backend to human-in-the-middle workflow applications that automate the bulk of the entity resolution work and present data experts with a more manageable set of edge case pairs to interpret. As an active-learning solution, models can be retrained to take advantage of this additional human input to improve future predictions and further reduce the number of cases requiring expert review.

Interested in seeing how this works? Then, please be sure to check out the Databricks customer entity resolution solution accelerator. In this accelerator, we show how customer entity resolution best practices can be applied leveraging Zingg and Databricks to deduplicate records representing 5-million individuals. By following the step-by-step instructions provided, users can learn how the building blocks provided by these technologies can be assembled to enable their own enterprise-scaled customer entity resolution workflow applications.

What's next?

September 20, 2023/11 min read

How Edmunds builds a blueprint for generative AI

September 9, 2024/6 min read

Never miss a Databricks post

Sign up

What's next?

How Edmunds builds a blueprint for generative AI

Building a Generative AI Workflow for the Creation of More Personalized Marketing Content