Skip to main content
Login
      • Discover
        • For Executives
          • For Startups
            • Lakehouse Architecture
              • Mosaic Research
              • Customers
                • Customer Stories
                • Partners
                  • Cloud Providers
                    Databricks on AWS, Azure, GCP, and SAP
                    • Consulting & System Integrators
                      Experts to build, deploy and migrate to Databricks
                      • Technology Partners
                        Connect your existing tools to your Lakehouse
                        • C&SI Partner Program
                          Build, deploy or migrate to the Lakehouse
                          • Data Partners
                            Access the ecosystem of data consumers
                            • Partner Solutions
                              Find custom industry and migration solutions
                              • Built on Databricks
                                Build, market and grow your business
                              • Databricks Platform
                                • Platform Overview
                                  A unified platform for data, analytics and AI
                                  • Data Management
                                    Data reliability, security and performance
                                    • Sharing
                                      An open, secure, zero-copy sharing for all data
                                      • Data Warehousing
                                        Serverless data warehouse for SQL analytics
                                        • Governance
                                          Unified governance for all data, analytics and AI assets
                                          • Real-Time Analytics
                                            Real-time analytics, AI and applications made simple
                                            • Artificial Intelligence
                                              Build and deploy ML and GenAI applications
                                              • Data Engineering
                                                ETL and orchestration for batch and streaming data
                                                • Business Intelligence
                                                  Intelligent analytics for real-world data
                                                  • Data Science
                                                    Collaborative data science at scale
                                                  • Integrations and Data
                                                    • Marketplace
                                                      Open marketplace for data, analytics and AI
                                                      • IDE Integrations
                                                        Build on the Lakehouse in your favorite IDE
                                                        • Partner Connect
                                                          Discover and integrate with the Databricks ecosystem
                                                        • Pricing
                                                          • Databricks Pricing
                                                            Explore product pricing, DBUs and more
                                                            • Cost Calculator
                                                              Estimate your compute costs on any cloud
                                                            • Open Source
                                                              • Open Source Technologies
                                                                Learn more about the innovations behind the platform
                                                              • Databricks for Industries
                                                                • Communications
                                                                  • Media and Entertainment
                                                                    • Financial Services
                                                                      • Public Sector
                                                                        • Healthcare & Life Sciences
                                                                          • Retail
                                                                            • Manufacturing
                                                                              • See All Industries
                                                                              • Cross Industry Solutions
                                                                                • Cybersecurity
                                                                                  • Marketing
                                                                                  • Migration & Deployment
                                                                                    • Data Migration
                                                                                      • Professional Services
                                                                                      • Solution Accelerators
                                                                                        • Explore Accelerators
                                                                                          Move faster toward outcomes that matter
                                                                                        • Training and Certification
                                                                                          • Learning Overview
                                                                                            Hub for training, certification, events and more
                                                                                            • Training Overview
                                                                                              Discover curriculum tailored to your needs
                                                                                              • Databricks Academy
                                                                                                Sign in to the Databricks learning platform
                                                                                                • Certification
                                                                                                  Gain recognition and differentiation
                                                                                                  • University Alliance
                                                                                                    Want to teach Databricks? See how.
                                                                                                  • Events
                                                                                                    • Data + AI Summit
                                                                                                      • Data + AI World Tour
                                                                                                        • Data Intelligence Days
                                                                                                          • Event Calendar
                                                                                                          • Blog and Podcasts
                                                                                                            • Databricks Blog
                                                                                                              Explore news, product announcements, and more
                                                                                                              • Databricks Mosaic Research Blog
                                                                                                                Discover the latest in our Gen AI research
                                                                                                                • Data Brew Podcast
                                                                                                                  Let’s talk data!
                                                                                                                  • Champions of Data + AI Podcast
                                                                                                                    Insights from data leaders powering innovation
                                                                                                                  • Get Help
                                                                                                                    • Customer Support
                                                                                                                      • Documentation
                                                                                                                        • Community
                                                                                                                        • Dive Deep
                                                                                                                          • Resource Center
                                                                                                                            • Demo Center
                                                                                                                            • Company
                                                                                                                              • Who We Are
                                                                                                                                • Our Team
                                                                                                                                  • Databricks Ventures
                                                                                                                                    • Contact Us
                                                                                                                                    • Careers
                                                                                                                                      • Working at Databricks
                                                                                                                                        • Open Jobs
                                                                                                                                        • Press
                                                                                                                                          • Awards and Recognition
                                                                                                                                            • Newsroom
                                                                                                                                            • Security and Trust
                                                                                                                                              • Security and Trust
                                                                                                                                          • Data and AI summit

                                                                                                                                            JUNE 9–12 | SAN FRANCISCO

                                                                                                                                            Data + AI Summit is almost here — don’t miss the chance to join us in San Francisco!

                                                                                                                                            REGISTER
                                                                                                                                          • Ready to get started?
                                                                                                                                          • Get a Demo
                                                                                                                                          Data and AI summit

                                                                                                                                          JUNE 9–12 | SAN FRANCISCO

                                                                                                                                          Data + AI Summit is almost here — don’t miss the chance to join us in San Francisco!

                                                                                                                                          REGISTER
                                                                                                                                          • Login
                                                                                                                                          • Try Databricks
                                                                                                                                          1. Blog
                                                                                                                                          2. /
                                                                                                                                            Retail & Consumer Goods
                                                                                                                                          3. /
                                                                                                                                            Article

                                                                                                                                          Using Images and Metadata for Product Fuzzy Matching with Zingg

                                                                                                                                          Using Images and Metadata for Product Fuzzy Matching with Zingg

                                                                                                                                          Published: September 24, 2023

                                                                                                                                          Retail & Consumer Goods6 min read

                                                                                                                                          by Luke Bilbro, Sonal Goyal and Bryan Smith

                                                                                                                                          Share this post

                                                                                                                                          Keep up with us

                                                                                                                                          Data Intelligence Platform for Retail

                                                                                                                                           

                                                                                                                                          Product matching is an essential function in many retail and consumer goods organizations. Incoming products are compared to items in the existing product catalog as suppliers make new items available for sale on online marketplaces. Suppliers compare product listings on retailer websites to ensure the content displayed matches terms and conditions in their contracts. Retailers may scrape each others’ websites and match products for pricing comparisons. And suppliers must reconcile retailer and third-party data for higher-level product aggregates with the individual SKUs they sale. For many organizations, this work is time-consuming and inexact.

                                                                                                                                          The principal challenge in performing this work is that different organizations label the same products differently. Small variations in displayed product names, descriptions or bullet point feature listings intended to better connect customers with items can make exact matches impossible. GTIN codes and other universal identifiers are not often presented with this data when delivered at SKU-level, and when provided at an aggregate level, organizations must match data using unfamiliar labels before forming additional allocations that align the information with the formal product catalog. In other instances, minor variations in product details that may go unnoticed by consumers reflect slight differences in how an item is produced, something which may or may not be significant enough, depending on the scenario, for two products to be recognized as meaningfully different. And of course, there’s always room for variations due to simple input error and data truncation.

                                                                                                                                          The simple solution to this problem is often just to look at the product listings. Humans are amazingly adept at bridging these differences and making judgements about whether two items are the same or similar enough to be considered the same for a given purpose. But as the number of products to compare grows, the number of potential comparisons grows multiplicatively, and very quickly we outpace what a human interpreter can keep up with.

                                                                                                                                          Machine Learning Enables Product Matching

                                                                                                                                          Using machine learning, we can build a model that compares the product metadata using a variety of techniques to estimate the probability that two items should be considered to be the same. Using a set of product pairs, some labeled as being matches and others labeled as being not matches (Figure 1), the model can learn how various measures of similarity between names, descriptions, prices, etc. influence our recognition of two products as a match.

                                                                                                                                          Figure 1. Highly similar candidate pair of products being labeled as a match or a non-match based on expert feedback

                                                                                                                                          Given the range of variations typically observed in the data, it is impossible to build a fully automated product matching pipeline. However, we might decide to automatically accept product pairs the model scores as highly likely to be matches. We might similarly automatically reject pairs scored a probability below a certain threshold. This leaves a much smaller set of products with matching probabilities somewhere in the middle that requires our experts’ attention.

                                                                                                                                          The challenge then becomes which products to compare. Simple rules might tell us that we don’t need to compare products in the Books category with products in the Shoes category, but once we’ve narrowed the items we are willing to compare, we are often still left with a large number of products that can be difficult to compare in an exhaustive manner. This is where the technique of similarity approximation, also known as blocking, comes into play.

                                                                                                                                          With blocking techniques, we can organize products within a space based on the degree of similarity between their attributes of relevance. We can then limit our comparisons to products that reside within a certain distance from one another, avoiding an exhaustive search between products so dissimilar as not to warrant a probability estimate. For example, two products in the same category that are essentially the same thing but which vary only by size or color might have enough overlap in their information to be considered to be close enough to one another to warrant consideration (Figure 2). Others may be so dissimilar that the blocking model can rule them out outright.

                                                                                                                                          Figure 2.  Dissimilar products with enough overlap in details to warrant consideration by an expert reviewer

                                                                                                                                          Product Images Serve As a Valuable Input

                                                                                                                                          But as was mentioned, even products adjacent to one other may not have a high enough degree of similarity for us to automatically accept a match. When this falls back to a human analyst, this person often compares the details for the candidate pair of products and uses these to make a judgment call.

                                                                                                                                          As consumers, we do this all the time when we compare products online, and quite often we look at the product images often displayed with an item to quickly decide whether two items are similar enough to warrant a more detailed comparison. Images contain quite a bit of recognizable detail that we as humans can naturally employ to tackle aspects of this product comparison problem. If we could fold this same capability into our automated product comparisons, we could greatly enhance the accuracy of our product matching estimates and further reduce the burden on our human interpreters.

                                                                                                                                          To do this, we can employ an embedding. An embedding is an array of numerical values that capture information about the structure of an image. Derived from models trained to compare a large number of images for similarities and dissimilarities, we can calculate the distance between the images derived from two images and use that to tell us something about the similarity between the two. Using this as yet another input into our fuzzy matching model, we can now have the model learn the degree to which two items should be considered the same based on both product metadata and image information.

                                                                                                                                          Zingg Enables Fuzzy Matching with Both Metadata and Images

                                                                                                                                          The various challenges that go into preparing incoming data, generating candidate pairs and estimating item matching probabilities require a considerable amount of specialized knowledge to address. The folks at Zingg, a Databricks partner, have worked to package the best practices from the field of entity-resolution into an easy to use, open source library that enables the various workflows that are typically constructed to perform product (and other master data) matching.

                                                                                                                                          With Zingg, the various attributes associated with an item are identified based on the role they serve in describing the item being compared. This informs Zingg about which rules and techniques might be applied to prepare these in performing attribute level comparisons. Zingg automatically applies these rules to prepare the data for blocking and presents to the expert user a series of candidate pairs which that user can either accept or reject. This user feedback then becomes the input to the assembly of a fuzzy matching model which weighs the similarities between prepared attributes in terms of how they appear to influence the expert supplied acceptances and rejections.

                                                                                                                                          As a distributed engine, Zingg has the advantage of being able to natively scale within a Databricks cluster so that as organizations need to work through comparisons on a large number of products, they can do so easily and efficiently. This is critical in scenarios where even modest volumes of product data can require large numbers of complex calculations, all of which need to be performed in a timely manner.

                                                                                                                                          And with the 0.4 release of Zingg, support for numerical arrays has been incorporated. This allows us to perform a wide range of item comparisons, including image comparisons once we’ve applied appropriate data preparation to our image data.

                                                                                                                                          See How Its Done

                                                                                                                                          To demonstrate how Zingg can be used on Databricks to perform product matching with product metadata and images, we’ve built a new solution accelerator. In this accelerator, we leverage the Amazon-Berkeley Objects dataset to identify duplicates and matches between an initial set of product data and an incremental set of “newly arrived” product data.

                                                                                                                                          Thumbnail images displayed with these products on the Amazon website are converted to embeddings leveraging a general purpose image model from Hugging Face. Through an interactive product labeling exercise facilitated by Zingg and hosted in a Databricks notebook, a model is trained to identify matching products using descriptive elements for each product and images.

                                                                                                                                          The code for this accelerator is freely accessible. We hope that this accelerator inspires organizations in the retail and consumer goods space to develop more scalable and more robust product matching workflows that tap into the full potential of the data available to them in order to focus their people-resources on the most pressing needs surrounding the use of these data.

                                                                                                                                          Check out our solution accelerator.

                                                                                                                                          Keep up with us

                                                                                                                                          Recommended for you

                                                                                                                                          Share this post

                                                                                                                                          Never miss a Databricks post

                                                                                                                                          Subscribe to the categories you care about and get the latest posts delivered to your inbox

                                                                                                                                          Sign up

                                                                                                                                          What's next?

                                                                                                                                          Data Intelligence Platforms

                                                                                                                                          Retail & Consumer Goods

                                                                                                                                          September 20, 2023/11 min read

                                                                                                                                          How Edmunds builds a blueprint for generative AI

                                                                                                                                          Building a Generative AI Workflow for the Creation of More Personalized Marketing Content

                                                                                                                                          Retail & Consumer Goods

                                                                                                                                          September 9, 2024/6 min read

                                                                                                                                          Building a Generative AI Workflow for the Creation of More Personalized Marketing Content

                                                                                                                                          databricks logo
                                                                                                                                          Why Databricks
                                                                                                                                          Discover
                                                                                                                                          • For Executives
                                                                                                                                          • For Startups
                                                                                                                                          • Lakehouse Architecture
                                                                                                                                          • Mosaic Research
                                                                                                                                          Customers
                                                                                                                                          • Customer Stories
                                                                                                                                          Partners
                                                                                                                                          • Cloud Providers
                                                                                                                                          • Technology Partners
                                                                                                                                          • Data Partners
                                                                                                                                          • Built on Databricks
                                                                                                                                          • Consulting & System Integrators
                                                                                                                                          • C&SI Partner Program
                                                                                                                                          • Partner Solutions
                                                                                                                                          Discover
                                                                                                                                          • For Executives
                                                                                                                                          • For Startups
                                                                                                                                          • Lakehouse Architecture
                                                                                                                                          • Mosaic Research
                                                                                                                                          Customers
                                                                                                                                          • Customer Stories
                                                                                                                                          Partners
                                                                                                                                          • Cloud Providers
                                                                                                                                          • Technology Partners
                                                                                                                                          • Data Partners
                                                                                                                                          • Built on Databricks
                                                                                                                                          • Consulting & System Integrators
                                                                                                                                          • C&SI Partner Program
                                                                                                                                          • Partner Solutions
                                                                                                                                          Product
                                                                                                                                          Databricks Platform
                                                                                                                                          • Platform Overview
                                                                                                                                          • Sharing
                                                                                                                                          • Governance
                                                                                                                                          • Artificial Intelligence
                                                                                                                                          • Business Intelligence
                                                                                                                                          • Data Management
                                                                                                                                          • Data Warehousing
                                                                                                                                          • Real-Time Analytics
                                                                                                                                          • Data Engineering
                                                                                                                                          • Data Science
                                                                                                                                          Pricing
                                                                                                                                          • Pricing Overview
                                                                                                                                          • Pricing Calculator
                                                                                                                                          Open Source
                                                                                                                                          Integrations and Data
                                                                                                                                          • Marketplace
                                                                                                                                          • IDE Integrations
                                                                                                                                          • Partner Connect
                                                                                                                                          Databricks Platform
                                                                                                                                          • Platform Overview
                                                                                                                                          • Sharing
                                                                                                                                          • Governance
                                                                                                                                          • Artificial Intelligence
                                                                                                                                          • Business Intelligence
                                                                                                                                          • Data Management
                                                                                                                                          • Data Warehousing
                                                                                                                                          • Real-Time Analytics
                                                                                                                                          • Data Engineering
                                                                                                                                          • Data Science
                                                                                                                                          Pricing
                                                                                                                                          • Pricing Overview
                                                                                                                                          • Pricing Calculator
                                                                                                                                          Integrations and Data
                                                                                                                                          • Marketplace
                                                                                                                                          • IDE Integrations
                                                                                                                                          • Partner Connect
                                                                                                                                          Solutions
                                                                                                                                          Databricks For Industries
                                                                                                                                          • Communications
                                                                                                                                          • Financial Services
                                                                                                                                          • Healthcare and Life Sciences
                                                                                                                                          • Manufacturing
                                                                                                                                          • Media and Entertainment
                                                                                                                                          • Public Sector
                                                                                                                                          • Retail
                                                                                                                                          • View All
                                                                                                                                          Cross Industry Solutions
                                                                                                                                          • Cybersecurity
                                                                                                                                          • Marketing
                                                                                                                                          Data Migration
                                                                                                                                          Professional Services
                                                                                                                                          Solution Accelerators
                                                                                                                                          Databricks For Industries
                                                                                                                                          • Communications
                                                                                                                                          • Financial Services
                                                                                                                                          • Healthcare and Life Sciences
                                                                                                                                          • Manufacturing
                                                                                                                                          • Media and Entertainment
                                                                                                                                          • Public Sector
                                                                                                                                          • Retail
                                                                                                                                          • View All
                                                                                                                                          Cross Industry Solutions
                                                                                                                                          • Cybersecurity
                                                                                                                                          • Marketing
                                                                                                                                          Resources
                                                                                                                                          Documentation
                                                                                                                                          Customer Support
                                                                                                                                          Community
                                                                                                                                          Training and Certification
                                                                                                                                          • Learning Overview
                                                                                                                                          • Training Overview
                                                                                                                                          • Certification
                                                                                                                                          • University Alliance
                                                                                                                                          • Databricks Academy Login
                                                                                                                                          Events
                                                                                                                                          • Data + AI Summit
                                                                                                                                          • Data + AI World Tour
                                                                                                                                          • Data Intelligence Days
                                                                                                                                          • Event Calendar
                                                                                                                                          Blog and Podcasts
                                                                                                                                          • Databricks Blog
                                                                                                                                          • Databricks Mosaic Research Blog
                                                                                                                                          • Data Brew Podcast
                                                                                                                                          • Champions of Data & AI Podcast
                                                                                                                                          Training and Certification
                                                                                                                                          • Learning Overview
                                                                                                                                          • Training Overview
                                                                                                                                          • Certification
                                                                                                                                          • University Alliance
                                                                                                                                          • Databricks Academy Login
                                                                                                                                          Events
                                                                                                                                          • Data + AI Summit
                                                                                                                                          • Data + AI World Tour
                                                                                                                                          • Data Intelligence Days
                                                                                                                                          • Event Calendar
                                                                                                                                          Blog and Podcasts
                                                                                                                                          • Databricks Blog
                                                                                                                                          • Databricks Mosaic Research Blog
                                                                                                                                          • Data Brew Podcast
                                                                                                                                          • Champions of Data & AI Podcast
                                                                                                                                          About
                                                                                                                                          Company
                                                                                                                                          • Who We Are
                                                                                                                                          • Our Team
                                                                                                                                          • Databricks Ventures
                                                                                                                                          • Contact Us
                                                                                                                                          Careers
                                                                                                                                          • Open Jobs
                                                                                                                                          • Working at Databricks
                                                                                                                                          Press
                                                                                                                                          • Awards and Recognition
                                                                                                                                          • Newsroom
                                                                                                                                          Security and Trust
                                                                                                                                          Company
                                                                                                                                          • Who We Are
                                                                                                                                          • Our Team
                                                                                                                                          • Databricks Ventures
                                                                                                                                          • Contact Us
                                                                                                                                          Careers
                                                                                                                                          • Open Jobs
                                                                                                                                          • Working at Databricks
                                                                                                                                          Press
                                                                                                                                          • Awards and Recognition
                                                                                                                                          • Newsroom
                                                                                                                                          databricks logo

                                                                                                                                          Databricks Inc.
                                                                                                                                          160 Spear Street, 15th Floor
                                                                                                                                          San Francisco, CA 94105
                                                                                                                                          1-866-330-0121

                                                                                                                                          See Careers
                                                                                                                                          at Databricks

                                                                                                                                          © Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the Apache Software Foundation.

                                                                                                                                          • Privacy Notice
                                                                                                                                          • |Terms of Use
                                                                                                                                          • |Modern Slavery Statement
                                                                                                                                          • |California Privacy
                                                                                                                                          • |Your Privacy Choices