Workflows helps data teams scale and reduce costs
with cluster reuse
jobs running daily
“Databricks Workflows allows us to clearly see how every job ran and whether it succeeded or failed. In our previous solution, we had a lot of moving parts, we had multiple triggers and multiple dependent pipelines that triggered each other. With the use of Workflows, there is only one job where we have all that information right in front of us.”
Ahold Delhaize is one of the world’s largest food retail groups and is the parent organization to several companies in multiple countries. The organization’s brands operate more than 7,452 grocery and specialty stores. Each of its brands is dedicated to helping customers eat well, save time and live better. Ahold Delhaize leverages data and AI in every aspect of its business — from customer personalization and recommendations to loyalty, food waste reduction and environmental impact programs, to logistics, forecasting, optimization, analytics, inventory management, and more. It also uses data to determine how promotions and sales perform among different customer groups, and for real-time monitoring, including things like holiday and weather data that inform business decisions such as sales and special offers. Using the Databricks Data Intelligence Platform, Workflows, and Auto Loader, the company created a self-service data platform to enable data engineers across its companies to build pipelines to meet their data science and AI/ML needs and create more value for approximately 60 million weekly customers.
Creating a self-service platform with Databricks Workflows and Auto Loader
Ahold Delhaize’s self-service data platform allows data engineers across the company to easily define “out of the box” data pipelines without spending any time on the laborious task of putting the pipelines together and deploying them to a production-suitable environment. Internal users of the self-service data platform create configuration files that define the workflows they need and check them in. The addition of new files to Git triggers GitHub Actions, which calls Terraform to create multitask workflows based on the definition files. These workflows help data consumers, such as data analysts and data scientists, who use it for BI and AI projects.
This CI/CD process makes configuring data flows faster and enables self-service for processes such as ML model training and BI dashboard updates. For example, the data team at Etos — one of many companies under the Ahold Delhaize umbrella — uses the platform for machine learning and model training for personalization and inventory forecasting, as well as analytics to analyze store performance and promos. “We give the data scientists everything they need, including their own Databricks workspace, so they can do whatever they want,” says Ivo Van de Grift, Tech Lead for the data team at Etos.
Data teams building on top of the self-service data platform own the workspaces, where they define data pipelines and have the flexibility to use additional lakehouse technologies such as Delta Live Tables (DLT). The team at Etos uses DLT to simplify the creation of data sets to prepare consumption pipelines to be used by data scientists and analysts. “DLT is the easiest way to create a consumption data set; it does everything for you,” Van de Grift says. “We’re a smaller team, and DLT saves us so much time.”
The self-service data platform that enables teams across the company (including the Etos data team) was developed by a core data team led by Charlotte Van der Scheun, Platform Engineering Team Tech Lead. According to Van der Scheun, building the platform for ingesting real-time data is critical. “For Ahold Delhaize, the top benefits of streaming are that it is cheaper and allows everyone to get data faster,” she explains. Ahold Delhaize uses Kafka to send incoming raw data (both online and offline data from physical stores, transactional data, product information, merchandising data, and data from online sources such as Google Analytics) to a landing zone and then to Auto Loader for streaming ingestion. Following the medallion architecture, raw data is stored in a “Bronze” Delta Lake table and then cleaned, deduped and loaded to a “Silver” Delta Lake table, where it is made available to teams across the organization.
The team initially used Azure Data Factory (ADF) but found that utilizing Databricks Workflows for orchestration and Auto Loader for ingestion made things easier. “ADF was used as a middleman to trigger workflows, but we realized with Auto Loader we didn’t need that anymore,” says Van der Scheun. According to Van de Grift, end users of the data platform, such as the Etos data team, also experience the benefits of removing the extra step. “With Databricks Workflows, we have a smaller technology footprint, which always means faster and easier deployments. It is simpler to have everything in one place,” he adds.
Similar to the Etos team, the data science team at Albert Heijn, a supermarket chain in the Netherlands that is also a part of Ahold Delhaize, is using Workflows for their specific needs. “Our team effectively leverages the features of Databricks Workflows to orchestrate complex feature engineering and ML training pipelines, breaking them down into smaller tasks,” says Panagiotis Giannakoulias, Lead ML Engineer at Albert Heijn. “This approach promotes error isolation, making it easier to identify and resolve issues, while also ensuring tasks run on time and in the correct sequence. By automating these tasks through Databricks Workflows, we enable continuous training and improvement for our models. Combined with the self-service data platform’s ease in data access and exchange, we create a seamless transition from development to deployment, empowering data innovation to truly flourish in our ever-evolving world.”
Databricks Workflows leads to reduced deployment time, lower costs
With Databricks Workflows and Auto Loader as its foundation, Ahold Delhaize’s self-service data platform provides the company’s internal data users with access to everything they need to perform a broad range of data science, analytics and AI/ML tasks. This has led to improved productivity. The company’s data team now runs approximately 1,165 ingestion jobs daily. When including jobs run in the consumer team’s workspaces, that number is roughly two to three times higher.
Databricks Workflows also helps Ahold Delhaize with monitoring and observability of these processes and makes troubleshooting easier. Van de Grift’s team at Etos configured Workflows to send email alerts when there are issues so the team can immediately see where the problem is and how to fix it. The core data team under Van der Scheun also leverages Workflows’ observability capabilities for better issue resolution and has noticed an improvement in that area. “Databricks Workflows allows us to clearly see how every job ran and whether it succeeded or failed. In our previous solution, we had a lot of moving parts, we had multiple triggers and multiple dependent pipelines that triggered each other. With the use of Workflows, there is only one job where we have all that information right in front of us,” she explains.
Meanwhile, building the new platform with better CI/CD processes, code restructuring and an improved infrastructure reduced deployment time. Full deployment in Ahold Delhaize’s previous platform took approximately 1.5 hours. In its new platform, an entirely new deployment takes about 20 minutes. In addition, cluster reuse has resulted in cost savings of more than 50%. This is on top of the significant savings they’ve already achieved using job clusters (instead of interactive clusters) and through the use of multitask jobs on a continuously running cluster. “All the cost savings we see from using job clusters and cluster reuse are passed down to our customers,” says Van de Grift. At the same time, using ML and AI for things like personalization and supply prediction and BI to understand how products and sales are performing leads to improved data-driven decision-making and further optimization across the organization.
Overall, Ahold Delhaize has been a front-runner when it comes to using new Databricks features and technologies, and that will continue, according to Van der Scheun. Going forward, she says Ahold Delhaize plans to build user and data access management on top of Unity Catalog. “Our current user access management is something we are looking to improve. In our platform, we had difficulty finding the right technology to help us build it,” she says. “Now that Unity Catalog is available, we are going to use it for user access management, and it will make our process much better and simpler.”