Skip to main content

Powering the AI revolution


Minutes to start ingesting streaming data


Faster development on Databricks vs. previous tech stack


Reduction in time spent by the data team on streaming data ingestion

CLOUD: Google Cloud

​​“I love Delta Live Tables because it goes beyond the capabilities of Auto Loader to make it even easier to read files. My jaw dropped when we were able to set up a streaming pipeline in 45 minutes.”

— Kahveh Saramout, Senior Data Engineer, Labelbox

Labelbox believes artificial intelligence (AI) can enhance every aspect of our lives — and strives to empower the quick and efficient development of transformative technologies. The company’s mission is to build the best products that align with AI. When Labelbox needed to build a consumption-based billing system, the company didn’t want data ingestion to be a bottleneck. That’s why Labelbox implemented the Databricks Data Intelligence Platform. With Databricks, Labelbox ingests streaming data in real time. The company follows the principles of the medallion architecture to clean up data for their Silver and Gold layer tables, using automation within the Databricks platform to avoid moving data manually. As Labelbox adds functionality to their platform, the company will rely on Databricks to deliver the data ingestion and aggregation power they need to keep innovating.

Consumption-based billing system taxes capabilities of lean data team

AI enables computers to draw insights from words, images and movement — just as humans do. When computers use AI to process vast amounts of data and deliver valuable information to us, they can power breakthroughs in healthcare, agriculture, robotics and more. Labelbox is at the heart of this AI revolution. The company provides a customizable platform that helps AI teams label the vast amounts of data they need to build world-class machine learning (ML) models.

AI teams use the Labelbox platform to save time and effort for their data scientists and engineers. Labelbox introduced Labelbox Units (LBUs) to offer a flexible pricing system for both pay-as-you-go and contract-based clients, with costs decreasing as platform usage escalates. Before the company could adopt this billing model, they needed to build a consumption-based billing system. And it needed to be maintainable by Labelbox’s lean data team.

“We considered building our billing system by creating models in our cloud data warehouse using dbt,” recalled Kahveh Saramout, Senior Data Engineer at Labelbox. “But then we would have needed another tool, to orchestrate those jobs. And that would mean setting up and maintaining our own clusters. This setup would have given us a lot of tools to configure, tie together and manage for the long term. That would have been beyond the bandwidth of our data team — it’s a responsibility we would have needed to push to our engineers.”

Delta Live Tables reduces development time by 95% for the data team

Saramout and his team didn’t have to look far for a platform that would offer more than the outstanding data warehousing capabilities available. Because Labelbox is a partner of Databricks, it made sense for the two companies to work together on the billing solution.

“We discovered that the Databricks Data Intelligence Platform is much more than a data warehouse — it speaks to the soul of data engineering,” said Saramout. “It allows us to not only process and ingest data, but also start up and manage jobs and orchestrate them easily on one platform. Considering we have a lean data team, Databricks offered us the clearest path to achieving value.”

While Labelbox’s previous tech stack ingested batch data well, it wasn’t designed to ingest the streaming data the company would be working with to calculate billing. Saramout and his team use Delta Live Tables to ingest data from the Labelbox application in real time.

“If we had tried setting up streaming data ingestion on our old tech stack, we probably would have used a range of cloud provider products,” Saramout explained. “That would mean learning those products and creating a proof of concept before diving in. It probably would have taken me at least a week to get something working — and if jobs failed, I would have had to write scripts to restart the application. With Delta Live Tables, we just pointed the solution at a bucket of data and were up and running in 45 minutes.”

Labelbox’s billing solution calculates usage within Databricks and pushes the data to a Stripe billing system. The solution also displays usage data in a dashboard within the Labelbox application. All billing data comes directly from a Databricks SQL warehouse, bypassing the production database to prevent any drag on performance.

“Our product team built an API to access data from the SQL warehouse with no help from me,” said Saramout. “All I did was add users to our Databricks workspace. From there, Databricks made it incredibly easy, providing example code for running queries and a button for them to generate tokens. They had the API built within a couple days.”

Following the principles of the medallion architecture, Labelbox builds their Silver layer tables by duplicating data from Delta Live Tables and cleaning it in micro-batches. Saramout set up a continuous job that uses all-purpose compute to find the latest records in Delta Live Tables and merge them into the Silver layer table.

“This is another example of how much time and effort we’re saving with Databricks,” Saramout reported. “If we were to run this job using Airflow, I would have had to schedule it and would have been limited to one run per minute. I also would have needed to build logic to check if there was a job already running. Instead of all that, I just turned on Continuous Mode in the Databricks platform.”

Labelbox’s Gold layer tables consist of production tables that display how much LBU each customer has used in each product. The company uses these tables to update their billing system, update entitlements in contracts and display usage data to customers within the Labelbox application. The Databricks SQL Connector makes it easy for the engineering team to access this data for dashboards — and Saramout can make sure all systems have the latest, cleanest usage data without moving data from one system to another.

“Moving data isn’t technically difficult, but it’s a huge pain,” Saramout explained. “It adds needless complexity and adds on days of development time. And our data warehouse billed us based on the data we scanned, so every time we move data into the platform, our costs went up. With Databricks, we can update data as often as we need to without paying extra or overworking our very lean teams.”

Databricks Data Intelligence Platform speeds innovation for growing AI firm

For a fast-growing AI firm like Labelbox, it’s essential to get new capabilities up and running quickly and cost-effectively. Saramout has no regrets about launching Labelbox’s billing solution on Databricks rather than on the company’s previous tech stack.

“It’s no exaggeration to say it would have taken us three times as long to build this app on our old stack, due to all the DevOps work it would have entailed,” reported Saramout. “And in the end, unless we had significantly optimized that solution, it wouldn’t have ended up being much cheaper. Plus, the payback would have taken a long time.”

Labelbox continues to innovate on the Databricks platform. Their Model Assisted Labeling feature allows customers to train a model to label data for them, reducing the amount of manual work. Labelbox has introduced the Automation Efficiency Score to help users better understand the effectiveness of their Model Assisted Labeling usage in terms of value and performance. The company makes these calculations within Databricks.

Labelbox also plans to release an enhanced version of the Performance Dashboard that gives customers an overview of metrics such as labels created, data rows uploaded and average time to label. The company will use Delta Live Tables to ingest production data in real time.

“We work in a space where there’s a ton of big data,” Saramout concluded. “The Databricks platform and Apache SparkTM enable much more than just routine business intelligence and data analysis. As we develop new products and services that run on Databricks, our engineering team keeps finding new use cases and deriving new value from the platform. We see Databricks playing a huge part in our successful growth for years to come.”