Skip to main content

Kythera Labs builds its business on Delta Live Tables and the Databricks Data Intelligence Platform

3+ petabytes

Of raw healthcare data efficiently stored and processed into actionable insights


Reduction in query pipeline processing time


Faster time to market


“Before DLT (on Databricks), we could design, test and run a query pipeline to find a disease cohort in 2 days. With DLT, we can do it in 2 to 4 minutes. It can take about 2 weeks for our competitors that don’t use Databricks to control the complexity of healthcare data.”

– Jeff McDonald, CEO/Co-founder, Kythera Labs

More healthcare data is available today than ever before. In 2019, the founders of Kythera Labs recognized that the sheer volume of structured and unstructured healthcare data available offered substantial opportunities to develop breakthrough clinical and commercial solutions. Yet healthcare and life sciences organizations often struggle to extract actionable insights and value from the massive and continuously increasing amount of complex data available. Kythera Labs is on a mission to change that. The organization is using the Databricks Data Intelligence Platform — including Delta Live Tables, Delta Sharing and Unity Catalog — to deliver enormous remastered healthcare data sets and derived data products ready for analysis that leads the healthcare industry to more precise insights with higher confidence.

Big data storage and processing challenges restrict long-term vision

Kythera Labs integrates its data science, statistical modeling and advanced methodologies into a healthcare-focused big data platform. By handling the “heavy lifting” — sourcing, collecting, organizing, standardizing, improving and transforming data — Kythera provides its healthcare and life sciences clients with data ready for analysis. That allows its clients to make sense of some of the messiest data in the world and uncover insights that can improve patient outcomes — “a perfect big data problem,” says Jeff McDonald, CEO and Co-founder at Kythera Labs.

Achieving its mission means Kythera must manage an enormous data pipeline. The company collects and manages data from about 2 billion medical transactions and approximately 3 billion prescription transactions annually.

When Kythera was founded in 2019, the company struggled to store and process its big healthcare data efficiently and cost-effectively. To realize its vision to help drive improvements in the healthcare ecosystem, Kythera had to solve its big data pipeline challenges. That’s where Databricks came in.

Transforming historical healthcare data sets into insights

With the Databricks Data Intelligence Platform as the central hub for all its remastered data and data-derived products, Kythera Labs can now deliver terabytes of remastered healthcare data and derived data products to its healthcare and life sciences customers instantly. “As soon as we started leveraging Databricks, everything changed,” says McDonald.

Wayfinder, Kythera’s proprietary Databricks Data Intelligence Platform OEM offering, unifies healthcare data, analytics and AI workloads into one ecosystem and makes it possible to identify actionable insights faster. “When you have a huge data set representing a few hundred million patients transiting through healthcare in the United States, few recognize the transformations required just to unify it, never mind harmonizing it into something usable,” says McDonald. “It’s just data prior to us touching it, and it’s pretty messy data. The lakehouse and Wayfinder enable our clients to turn that data into insights without taking on the effort required to engineer and manage the data themselves.” McDonald adds that Delta Sharing makes it easy to continuously improve source data while keeping storage costs low due to the ability to share live data without copying or replication.

Wayfinder now delivers access to over 45 terabytes of de-identified, remastered claims data and meets the unique data and processing needs of Kythera’s healthcare and life sciences customers at scale so they can stop preparing and managing data and start taking action on it.

Streaming is easy on the Databricks Data Intelligence Platform because of expandable storage and processing. Kythera uses Delta Live Tables (DLT) to automate and simplify the orchestration of ETL pipelines, which helps deliver better insights because the data is more reliable and higher quality. Kythera primarily uses DLT for its patient event assets — derived data products that deliver analysis-ready encounter-level data. One way the healthcare and life sciences industries use this data is in identifying best-fit providers for rare disease patients — an audience for whom query requirements can be incredibly complex. DLT keeps data fresh and reduces its complexity. “To build a cohort of 5,000 or 10,000 rare disease patients, we have to sift through about 700 billion records,” says McDonald. “We’ve completely stripped away the complexity with DLT. Using DLT, our clients can interrogate that massive data set and not have to worry about all the updates that occur because they are happening as the pipelines run for the core products.”

Additionally, Kythera uses Unity Catalog to simplify its ability to share data with others securely. “Unity Catalog has significantly cut down our storage and processing spend while assuring clients have access to only the views of data they purchased,” says McDonald.

Improving data quality and speed while lowering costs

Today, Kythera Labs uses Databricks to refine, remaster and build its high-value data assets more cost-effectively than it could with its own infrastructure and to deliver those assets to its clients through Wayfinder. Streaming and processing big data in a single lakehouse has simplified permissions and lowered ETL needs, which translates to speed. “We have gained speed in a variety of ways, primarily by being able to set up more jobs in parallel with the scalable infrastructure,” says Matt Ryan, Co-founder and Director of Engineering at Kythera Labs.

As a result, Kythera can now easily manage all 45 terabytes of its remastered healthcare data.

For Kythera and its clients, the use of DLT translates into time and cost savings. “Before DLT (on Databricks), we could design, test and run a query pipeline in 2 days. With DLT, we can do it in 2 to 4 minutes. It can take about 2 weeks for our competitors that don’t use Databricks,” says McDonald.

Finally, the lakehouse architecture unifies data so Kythera can use it for both analysis and machine learning while reducing the risk of data egress and helping customers lower costs, especially compared to Snowflake. “Customers tell us their Snowflake costs are too high, almost without exception. And we know from experience,” says McDonald. “We tried this with Snowflake; the ETL and egress costs were nearly 5x what we spend with the Databricks Data Intelligence Platform. When our customers want to deconstruct the geographic distribution of 10 million cancer patients, the cost adds up quickly if your data isn’t ready for analysis. When we start someone in a de-normalized model [on Databricks], they instantly see the data they want without having to be an engineer. The data is prepped and ready for analysis.”

Overall, simplified architecture, reduced engineering requirements and automated cluster management have enabled Kythera to reduce IT operational costs. “To develop this same infrastructure without Databricks would be extremely expensive and time-consuming, taking focus away from building our business,” says McDonald. “Databricks is the only way we make our business grow. We’ve tried Snowflake, we’ve tried others, and we’re sticking with Databricks.”

While it’s difficult to put a number on how Databricks has improved Kythera’s business, McDonald says it’s improved by “orders of magnitude” when it comes to opportunity. “We’ve won at least three new clients because of how we're able to fast-track their analysis with data they can use immediately,” he says. “We just signed a large pharma company that told us use of our products will save them 2 years in development.”

In the future, Kythera Labs also plans to use Delta Sharing to support sales with fully accessible “preview” data sets for engaged prospects and integrate remastered claims data with client-sourced data. This “bring your own data” paradigm will enable the analysis of more accurate, representative data — using Delta Lake’s optimized file partitioning to optimize storage and make its pipelines more efficient.

“No one wants to buy healthcare claims data. They want answers,” says McDonald. “Our customers are trying to determine if they should invest in developing a new treatment for a rare disease, expand their service lines, or open a new location. In the lakehouse, we accelerate their ability to access data and discover the answers they need.”