Using Images and Metadata for Product Fuzzy Matching with Zingg

Published: September 24, 2023

by Luke Bilbro, Sonal Goyal and Bryan Smith

Product matching is an essential function in many retail and consumer goods organizations. Incoming products are compared to items in the existing product catalog as suppliers make new items available for sale on online marketplaces. Suppliers compare product listings on retailer websites to ensure the content displayed matches terms and conditions in their contracts. Retailers may scrape each others’ websites and match products for pricing comparisons. And suppliers must reconcile retailer and third-party data for higher-level product aggregates with the individual SKUs they sale. For many organizations, this work is time-consuming and inexact.

The principal challenge in performing this work is that different organizations label the same products differently. Small variations in displayed product names, descriptions or bullet point feature listings intended to better connect customers with items can make exact matches impossible. GTIN codes and other universal identifiers are not often presented with this data when delivered at SKU-level, and when provided at an aggregate level, organizations must match data using unfamiliar labels before forming additional allocations that align the information with the formal product catalog. In other instances, minor variations in product details that may go unnoticed by consumers reflect slight differences in how an item is produced, something which may or may not be significant enough, depending on the scenario, for two products to be recognized as meaningfully different. And of course, there’s always room for variations due to simple input error and data truncation.

The simple solution to this problem is often just to look at the product listings. Humans are amazingly adept at bridging these differences and making judgements about whether two items are the same or similar enough to be considered the same for a given purpose. But as the number of products to compare grows, the number of potential comparisons grows multiplicatively, and very quickly we outpace what a human interpreter can keep up with.

Machine Learning Enables Product Matching

Using machine learning, we can build a model that compares the product metadata using a variety of techniques to estimate the probability that two items should be considered to be the same. Using a set of product pairs, some labeled as being matches and others labeled as being not matches (Figure 1), the model can learn how various measures of similarity between names, descriptions, prices, etc. influence our recognition of two products as a match.

Figure 1. Highly similar candidate pair of products being labeled as a match or a non-match based on expert feedback

Given the range of variations typically observed in the data, it is impossible to build a fully automated product matching pipeline. However, we might decide to automatically accept product pairs the model scores as highly likely to be matches. We might similarly automatically reject pairs scored a probability below a certain threshold. This leaves a much smaller set of products with matching probabilities somewhere in the middle that requires our experts’ attention.

The challenge then becomes which products to compare. Simple rules might tell us that we don’t need to compare products in the Books category with products in the Shoes category, but once we’ve narrowed the items we are willing to compare, we are often still left with a large number of products that can be difficult to compare in an exhaustive manner. This is where the technique of similarity approximation, also known as blocking, comes into play.

With blocking techniques, we can organize products within a space based on the degree of similarity between their attributes of relevance. We can then limit our comparisons to products that reside within a certain distance from one another, avoiding an exhaustive search between products so dissimilar as not to warrant a probability estimate. For example, two products in the same category that are essentially the same thing but which vary only by size or color might have enough overlap in their information to be considered to be close enough to one another to warrant consideration (Figure 2). Others may be so dissimilar that the blocking model can rule them out outright.

Figure 2. Dissimilar products with enough overlap in details to warrant consideration by an expert reviewer

Product Images Serve As a Valuable Input

But as was mentioned, even products adjacent to one other may not have a high enough degree of similarity for us to automatically accept a match. When this falls back to a human analyst, this person often compares the details for the candidate pair of products and uses these to make a judgment call.

As consumers, we do this all the time when we compare products online, and quite often we look at the product images often displayed with an item to quickly decide whether two items are similar enough to warrant a more detailed comparison. Images contain quite a bit of recognizable detail that we as humans can naturally employ to tackle aspects of this product comparison problem. If we could fold this same capability into our automated product comparisons, we could greatly enhance the accuracy of our product matching estimates and further reduce the burden on our human interpreters.

To do this, we can employ an embedding. An embedding is an array of numerical values that capture information about the structure of an image. Derived from models trained to compare a large number of images for similarities and dissimilarities, we can calculate the distance between the images derived from two images and use that to tell us something about the similarity between the two. Using this as yet another input into our fuzzy matching model, we can now have the model learn the degree to which two items should be considered the same based on both product metadata and image information.

Zingg Enables Fuzzy Matching with Both Metadata and Images

The various challenges that go into preparing incoming data, generating candidate pairs and estimating item matching probabilities require a considerable amount of specialized knowledge to address. The folks at Zingg, a Databricks partner, have worked to package the best practices from the field of entity-resolution into an easy to use, open source library that enables the various workflows that are typically constructed to perform product (and other master data) matching.

With Zingg, the various attributes associated with an item are identified based on the role they serve in describing the item being compared. This informs Zingg about which rules and techniques might be applied to prepare these in performing attribute level comparisons. Zingg automatically applies these rules to prepare the data for blocking and presents to the expert user a series of candidate pairs which that user can either accept or reject. This user feedback then becomes the input to the assembly of a fuzzy matching model which weighs the similarities between prepared attributes in terms of how they appear to influence the expert supplied acceptances and rejections.

As a distributed engine, Zingg has the advantage of being able to natively scale within a Databricks cluster so that as organizations need to work through comparisons on a large number of products, they can do so easily and efficiently. This is critical in scenarios where even modest volumes of product data can require large numbers of complex calculations, all of which need to be performed in a timely manner.

And with the 0.4 release of Zingg, support for numerical arrays has been incorporated. This allows us to perform a wide range of item comparisons, including image comparisons once we’ve applied appropriate data preparation to our image data.

See How Its Done

To demonstrate how Zingg can be used on Databricks to perform product matching with product metadata and images, we’ve built a new solution accelerator. In this accelerator, we leverage the Amazon-Berkeley Objects dataset to identify duplicates and matches between an initial set of product data and an incremental set of “newly arrived” product data.

Thumbnail images displayed with these products on the Amazon website are converted to embeddings leveraging a general purpose image model from Hugging Face. Through an interactive product labeling exercise facilitated by Zingg and hosted in a Databricks notebook, a model is trained to identify matching products using descriptive elements for each product and images.

The code for this accelerator is freely accessible. We hope that this accelerator inspires organizations in the retail and consumer goods space to develop more scalable and more robust product matching workflows that tap into the full potential of the data available to them in order to focus their people-resources on the most pressing needs surrounding the use of these data.

Check out our solution accelerator.