Skip to main content
Industries header

This blog was written in collaboration with Sukh Sekhon, Software Engineer, Cloud Infrastructure and Helen Li, Sr. Director of Engineering at Exai Bio.


Exai Bio is a next-generation liquid biopsy company with a mission to enable a world where cancer can be detected early, diagnosed accurately, treated in a personalized and targeted way, and ultimately cured. In this blog post, we describe how our engineering team at Exai Bio is leveraging Databricks to bring software engineering best practices to life sciences research and development (R&D).

Highly Differentiated Approach to Liquid Biopsy

Exai’s platform uses RNA sequencing to identify a novel category of cancer-associated, small non-coding RNAs, termed orphan non-coding RNAs (oncRNAs). This technology is based on research first published by Dr. Hani Goodarzi’s lab at the University of California, San Francisco, in Nature Medicine in 2018. oncRNAs are actively secreted from living cells and are stable and abundant in the blood of cancer patients, making them a novel type of cancer biomarker that is accessible through a standard blood draw. In the 18 months since its founding, Exai Bio has performed analyses spanning 12 cancers and over 10,000 subjects, building one of the largest smRNA sequencing dataset and oncRNA profiles in cancer and general populations.

Challenges in Life Sciences R&D

What Life Sciences research is and why it is important

Life Sciences R&D refers to the systematic investigation and development efforts with the aim of producing new or significantly improved products, processes, or knowledge to ultimately benefit patients. Exai is focused on developing tests that will assist with early detection and actionable insights into cancer.

How this is an engineering problem

The fast-paced and evolving nature of research pushes researchers to adopt or develop new tools and techniques; unfortunately, this often leads to performance testing, vulnerability scanning and code provenance being an afterthought. Some researchers are familiar with running commands like awk and parallel on sequencing data stored on their local computer, others are familiar with submitting jobs to high-performance computing (HPC) machines operated by an academic institution. Some researchers have scripts that run steps serially while some leverage workflow orchestration frameworks. Software engineers are challenged to not only create a standardized environment for researchers to use, but one that is flexible enough to cater to these diverse backgrounds.

How some solutions leave something to be desired

There are infrastructural tools that researchers are increasingly adopting. However, relying on these solutions alone present certain gaps. For smaller companies, addressing these gaps often means resorting to makeshift solutions or 'duct taping', which requires long term investment and lacks future adaptability. As a startup, we are able to innovate quicker using Databricks’ robust set of features, while having the flexibility to bring our own tools as needed.

Accelerated R&D with Databricks

Exai's founding team, which includes pioneers in genomics and oncology, knows well the engineering challenges in life sciences research. On day 1, our engineering team set out to discover a solution allowing for reproducible research, accelerated science, and a secure data platform. We found a compelling offering in Databricks. As we continue our journey, we are consistently impressed by the ongoing improvements and innovations that Databricks brings to our data-driven workflows.

1. Reproducible research

In our pursuit of reproducibility, Databricks Container Services and Repos have been invaluable features.

Through Databricks Container Services, we are able to control software dependencies and runtime requirements by standardizing the compute environments researchers depend on. As containerization became increasingly valuable for researchers, we established a streamlined CI/CD pipeline to efficiently address the increasing demand for custom Docker images. This pipeline significantly increased our engineering team's ability to rapidly create, test, and deploy these images.

Through Databricks Repos, we brought git-based workflows to data analysis, allowing notebooks to be code reviewed. Our researchers embraced this code review process, a common software engineering practice. We find that it facilitates collaboration, ensures reproducible analyses, and identifies bugs early.

2. Accelerated science

Databricks allows us to move fast on our research timeline. Since Exai’s founding, we have presented over seven research datasets at top oncology conferences. Databricks's intuitive cluster management lets our researchers harness cloud resources without diving deep into specialized cloud knowledge. Even though compute is readily available to all researchers, we are able to manage our cost within budget with features like cluster termination policy and cost breakdown.

Having a full suite of Lakehouse capabilities (ETL, visualization, governance) in an integrated environment mitigates data silos. Databricks Workflows allow researchers without experience in workflow orchestration tools to specify job dependencies and parallelize their analyses with ease.

Given Exai’s mission, it’s critical to have a fast feedback loop in our method development. The ease of compute and organizing data in Databricks made this possible.

3. Secure data platform

Exai generates and processes terabytes of sequencing data weekly. We require data security, privacy and confidentiality. Databricks's Security and Trust Center makes it easy for us to satisfy compliance frameworks.

With practical and readily available infrastructure-as-code documentation from Databricks, we are able to manage and scale our infrastructure with ease. Using Databricks’ Terraform provider, we followed the Databricks Security Reference Architecture; in addition, we implemented a centralized network architecture following this guide to channel the flow of traffic through a single network firewall.

Data is as secure as the code that runs on it. Understanding where code comes from, who has modified it, and how it has evolved, is important. We did not want this aspect to be an afterthought at Exai. Databricks allows us to use our own network security controls and make software available through package distribution mechanisms. We therefore have confidence that packages have been vetted before they are used by our researchers.

Below is the architecture diagram that illustrates key building blocks of our software engineering infrastructure built with Databricks running on AWS for our R&D.

Bringing Software Engineering Best Practices to Life Sciences R&D at Exai Bio

We are excited to share what we have learned building on Databricks. If you have questions, please reach out to our engineering team at [email protected].

Try Databricks for free

Related posts

Industries category icon 1

Patient Disease Risk Prediction with Lakehouse

July 26, 2023 by Amir Kermany in Industries
All healthcare is personal. Individuals have different underlying genetic predispositions, environmental exposures, and past medical histories, not to mention different propensities to engage...
Engineering blog

Saving Mothers with ML: How CareSource uses MLOps to Improve Healthcare in High-Risk Obstetrics

This blog post is in collaboration with Russ Scoville (Vice President of Enterprise Data Services), Arpit Gupta (Director of Predictive Analytics and Data...
Industries category icon 2

Getting started with generative AI in healthcare and life sciences

August 29, 2023 by Mike Sanky, Amir Kermany and Aaron Zavora in Industries
The explosive growth of ChatGPT has influenced every industry to reexamine their artificial intelligence (AI) strategies. While healthcare & life sciences has been...
See all Industries posts