PRODUCT
SPOTLIGHT: Spark Declarative Pipelines

Carvana turns to Databricks Spark Declarative Pipelines and streaming to uplevel CX

$500k

Saved in data warehousing costs

Smiling family singing together in a car.

Carvana is an online used car retailer based in Arizona. Founded in 2012, the company’s mission is to change how people buy and sell cars by offering an intuitive and convenient online car buying, selling and financing experience. Data is key to helping Carvana achieve that mission. Carvana developed its Next Generation Communication Platform (NGCP) to help car buyers and sellers enjoy a seamless car shopping experience. NGCP engineers and product teams built the data platform from the ground up by researching and prototyping new technologies and working as a team to deploy new features and services to production. However, the team faced scale and data quality challenges and high data warehouse costs. Moving to the Databricks Data Intelligence Platform, which includes Spark Declarative Pipelines and Databricks SQL Serverless, enabled the NGCP team to overcome those challenges, lower costs, and accomplish near real-time streaming and analytics.

Data quality, availability and cost impact ability to deliver value

Technology has changed the way people buy and sell cars. Carvana’s Next Generation Communications Platform (NGCP) team specializes in customer communications and AI, providing a platform where potential customers interact with Carvana via chatbot, human advocates and emails. The team uses AI for initial interactions and trains advanced natural language processing (NLP) models to continuously improve customer experience and speed to answer. At the time of writing, the team consists of five data and analytics engineers. The biggest data set they manage has about 3 billion records, a number that is still growing today.

The team initially streamed its conversation and AI data into Google BigQuery, but that created several challenges. First, it limited how data engineers could partition and optimize query tables. Data quality was another challenge. Engineers needed to dedupe in the pipeline, but distinct calls on large data frames were slow and caused recomputation on the entire data set. Engineers were often only aware of such issues once they surfaced much later through downstream analytics dashboards. The team wanted to identify data problems earlier in the pipeline and define rules to handle errors.

The NGCP team also faced data availability challenges. There was no process to automatically pick up experiment data as campaigns were configured and run, and data was generated. Data availability depended solely on data engineers writing and releasing Spark jobs. Maintenance and transparency were another challenge, as a single repo contained both the ETL and business logic. Finally, the data sets produced often contained too many files to be shipped to data warehouses via the Spark Connector, creating a data export bottleneck.

Carvana needed a cost-effective solution that would allow it to avoid reprocessing of historical data and use Spark Structured Streaming to intelligently and incrementally ingest data. It also needed a way to separate data ingestion from business logic, automate data lineage, and incrementally or selectively write refined analytics events to a data warehouse for consumption.

Building scalable and testable comms data pipelines

By moving to the Databricks Data Intelligence Platform, Carvana’s NGCP team could now manage the complex dependencies and scale of its data pipelines. According to Christina Taylor, Special Projects Team Lead, the Databricks Data Intelligence Platform is faster than Amazon Elastic MapReduce (Amazon EMR) and makes it easier for her team to collaborate. “There is no added value a separate data warehouse could provide that the Databricks lakehouse couldn’t,” says Taylor. “As a developer coming from EMR, Databricks was a massive upgrade. It was easy to access Spark UI to debug the bottlenecks and share notebooks. I didn’t have to write so much code to set up Spark clusters or deploy jobs. It democratized Spark Structured Streaming and made it much more accessible for everyone.”

Taylor’s team now uses Spark Declarative Pipelines as a single-entry point for streaming and batch jobs, dependency orchestration, data quality, and error handling, enabling them to build scalable and testable pipelines under a data medallion architecture with simple and declarative syntax. “You just focus on new data coming in and Auto Loader takes care of that for you, preventing reprocessing of old data,” says Taylor. “It’s much more efficient.”

Overall, using Spark Declarative Pipelines helps Carvana’s NGCP team test and develop fast. “All of the things we build are testable, so we have scalability, data quality and lineage in one interface. We really like that,” says Taylor.

Spark Declarative Pipelines also provided several technical improvements. “With a traditional multitask job, you don’t see data lineage in the same way you do with Spark Declarative Pipelines, which is one of our favorite features,” adds Taylor. “The data quality check is another favorite feature. You don’t have to deploy a separate Great Expectations stack on Airflow to do your quality check. If an event is delayed, we get real-time alerts.”

At the data warehouse level, Carvana uses Databricks SQL Serverless and Delta Lake, which improve speed for real-time analytics use cases where Carvana data engineers need accurate data with very low latency. “Going serverless certainly upped my usage,” says Cammeron Linneen, Senior Analyst at Carvana. “The serverless warehouse is a big Databricks SQL selling point because you don’t have to manage the compute behind it. Databricks was a massive upgrade as soon as I started working with it.”

Lakehouse architecture empowers data scientists to do more to improve CX

Reducing full table scans and data extraction from BigQuery has saved Carvana approximately $500,000 a year in data warehouse costs. Meanwhile, the lakehouse provides a foundation for Carvana’s data scientists and analysts to store and unify near real-time data in one place, while Databricks SQL allows the team to tap into the freshest and latest data directly from the lake. “With all NGCP data now in Delta Lake, I can perform an ad hoc deep dive analysis, run a regression model, or run any arbitrary code I want on that data directly,” says Linneen. “If I’m working on building out analytics tooling, or even running jobs and ensuring I’m getting the output I expect, it’s just nice. I don’t have to wait for a cluster to spin up because I wrote something to Delta Lake. I can just go to the serverless warehouse and run a quick query to check things out.”

Ultimately, moving to Databricks allows the Carvana Communications team to work more effectively and create more accessible data sets.

“There’s a lot of potential for us to empower data scientists here at Carvana and to grow a team of highly competent analytics engineers and data analysts who want a career beyond SQL,” says Taylor. “Databricks tools can help get us there.”

Carvana turns to Databricks Spark Declarative Pipelines and streaming to uplevel CX

Saved in data warehousing costs

Data quality, availability and cost impact ability to deliver value

Building scalable and testable comms data pipelines

Lakehouse architecture empowers data scientists to do more to improve CX

Ready to get started?