Liquid biopsies unlock noninvasive cancer screening and monitoring by analyzing cancer biomarkers in blood, but the signals can be sparse and noisy. Exai Bio has pioneered AI-driven liquid biopsy using novel small RNA biomarkers. In recent work, Exai-1 and Orion – two new generative AI for cell-free RNA – achieve breakthroughs in signal denoising and early cancer detection. These advances were made possible by Databricks’ lakehouse architecture and cloud AI infrastructure. By unifying large genomic datasets and providing managed ML tools (MLflow, Workflows, scalable clusters), Databricks enables Exai’s researchers to train large multimodal models on thousands of patient samples. In this joint effort, we highlight Exai Bio’s technical breakthroughs and show how Databricks’ lakehouse and MLOps ecosystem accelerate cutting-edge biomedical AI.
Multimodal Foundation Models for Liquid Biopsy
Exai Bio’s latest research introduces large generative models tailored to liquid biopsy data. These models integrate sequence information, molecular abundance, and rich metadata to learn high-quality representations of cancer-associated RNAs.
- Exai-1 (cfRNA Foundation Model): A transformer-based variational autoencoder that unites RNA sequence embeddings with cell-free RNA (cfRNA) abundance profiles. Exai-1 is pretrained on massive datasets – over 306 billion sequence tokens from 13,014 blood samples – learning a biologically meaningful latent structure of cfRNA expression. By leveraging both sequence (via embeddings from the RNA-FM language model) and expression data, Exai-1 “enhances signal fidelity, reduces technical noise, and improves disease detection by generating synthetic cfRNA profiles”. In practice, Exai-1 can denoise sparse cfRNA measurements and even augment datasets: classifiers trained on Exai-1’s reconstructed profiles consistently outperform those trained on raw data. This generative transfer-learning approach effectively creates a foundation model for any cfRNA-based diagnostic task – e.g. using the same pretrained embeddings to detect other cancers or new biomarkers.
- Orion (OncRNA Generative Classifier): A specialized variational-autoencoder (VAE) for circulating orphan non-coding RNAs (oncRNAs), which are small RNAs secreted by tumors. Orion has a twin VAE architecture: it takes as input a count vector of cancer-associated oncRNAs and a vector of control RNAs (e.g. endogenous housekeeping RNAs). Each input feeds a separate encoder; their outputs allow training a robust classifier and reconstructing the underlying oncRNA distribution. Importantly, Orion’s training includes contrastive and classification losses: a triplet margin loss pulls together samples with the same phenotype (cancer vs. control) and pushes apart different phenotypes, removing batch effects and technical variations. The learned embedding is then used by a downstream classifier to predict cancer presence. On a cohort of 1,050 lung-cancer patients and controls, Orion achieved 94% sensitivity at 87% specificity for NSCLC detection across all stages, outperforming standard methods by ~30% on held-out data. This generative, semi-supervised model automatically denoises cfRNA signals and produces a compact cancer-specific fingerprint, enabling more accurate early detection than previous assays.

Figure 1: Architecture of Exai Bio’s Orion model for liquid biopsy. Image from Karimzadeh et al., Nat Commun.
Together, these models form a scalable AI framework for liquid biopsy. Exai-1 provides a general-purpose cfRNA “language model” that can generate realistic RNA profiles and boost downstream classifiers. Orion fine-tunes this approach to the specific problem of lung cancer screening. In both cases, the models generalize across different conditions – Exai-1 “facilitates cross-biofluid translation and assay compatibility” by disentangling true biological signals from confounders. The result is a new generation of AI tools that can mine subtle cfRNA biomarker patterns for early cancer detection and biomarker discovery.
Databricks Data Intelligence and AI Platform: The Enabling Infrastructure
These AI breakthroughs are powered by Databricks’ unified data analytics platform. Key capabilities include:
- Unified Lakehouse (Delta) Storage: We store all metadata (sample information, lab and experiment data) in Databricks Delta tables. This single lakehouse prevents data silos and enables real-time analytics. As the Databricks healthcare solution notes, the lakehouse “brings patient, research, and operational data together at scale” and eliminates legacy silos, making genomic and clinical data instantly queryable. For example, Exai’s 13,000+ blood samples (in serum and plasma) and over 10,000 prior small-RNA-seq datasets are all registered in Delta tables, which can be rapidly filtered and joined for model training.
- Scalable Compute & Clusters: Databricks’ cloud-native clusters let researchers spin up GPU or high-memory instances without deep DevOps effort. Databricks allows us to move fast. Cluster management is intuitive, and features like auto-termination and cost dashboards keep budgets in check. This on-demand scaling enabled optimization and training of Exai-1 and Orion on hundreds of CPU cores/GPUs. Databricks Workflows (formerly Jobs) organize “compute”: researchers can launch multi-stage ETL and training pipelines with defined dependencies, parallelizing tasks without writing complex orchestration code.
- MLflow for MLOps: Every experiment run (hyperparameters, datasets, metrics, artifacts) is tracked in MLflow, which is tightly integrated into Databricks. Databricks provides all MLflow environment setup such as the tracking server and makes it available with no setup. MLflow’s experiment tracking and model registry ensure reproducibility and collaboration. With managed MLflow, logging metrics and artifacts from tens of models which really made it possible to perform ablation studies and optimize features that improve different aspects of model performance.
- Reproducible Environments: Databricks Container Services and Git-based Repos (with CI/CD) lock down software dependencies for each pipeline. This has been crucial for Exai Bio’s research stack (including custom bioinformatics tools), ensuring that every team member runs models in identical environments. In short, Databricks provides a turnkey MLOps platform: data ingestion with Spark, experiment tracking with MLflow, orchestration with Jobs/Workflows, and elastic compute with auto-scaling.
Impact on Cancer Detection and Biomarker Discovery
The combined scientific and engineering advances have major implications:
- Enhanced Early Detection – By amplifying cfRNA cancer signal against the background of blood RNA molecules, our AI models can detect cancer at early stages. Exai-1’s denoising yields clearer signals even in small-volume blood samples, while Orion’s generative embedding achieves high sensitivity (94%) for early-stage lung cancer. Such improvements could translate into more reliable screening tests (e.g. annual blood tests) that catch tumors at curable stages.
- New Biomarker Insights – The models learn from raw RNA data, reducing biases of targeted panels. For instance, Orion identified hundreds of novel oncRNAs from TCGA and tissue data, then validated their importance in blood. Exai-1’s latent space combines RNA sequence, structure, and abundance information which could highlight previously overlooked biomarkers. Importantly, the transfer-learning paradigm enables us to incorporate new discoveries quickly (e.g., swapping in new sequence tokens) and fine-tune on the unified platform.
- Generative Data Augmentation – Exai-1 can simulate realistic cfRNA profiles by sampling from its decoder. This synthetic data boosts classifier training, as shown by higher AUCs when using Exai-1 reconstructions. In practice, this means rare cancer signatures can be learned more robustly despite limited real samples. In other words, the foundation model mitigates data scarcity – a critical factor since “detecting rare cancers… necessitates foundational models and substantial training data”.
- Scalable Research Collaboration – By building on Databricks, Exai’s multidisciplinary team (biologists, bioinformaticians, biostatisticians, ML scientists, and data engineers) can collaborate seamlessly. Data scientists run PyTorch and Spark side by side; biostatisticians query cohorts with R; biologists log new processed samples, and reports/dashboards refresh automatically. This rapid feedback loop has allowed the Exai team to showcase the applications of their liquid biopsy and AI system in multiple cancer types, resulting in seven conference publications in 18 months. It exemplifies how enterprise-grade AI infrastructure accelerates life-science R&D.
Looking Ahead
The collaboration between Exai Bio and Databricks showcases how cutting-edge AI models and modern cloud architecture together push the frontiers of cancer diagnostics. Exai Bio’s foundation and generative AI models (Exai-1 and Orion) demonstrate that deep generative learning can extract powerful signals from liquid biopsies. Underlying these advances is Databricks’ Lakehouse – unifying heterogeneous biomedical data – and its managed ML tools (MLflow, Workflows, Pipelines) that make large-scale experimentation practical and reproducible. Looking ahead, we will continue refining our models and pipelines. Together, Exai Bio and Databricks are laying the groundwork for AI-powered precision oncology that is both scalable and clinically impactful.
Sources: Exai Bio et al., “A multi-modal cfRNA language model for liquid biopsy” (Nature Machine Intelligence, 2025); Exai Bio et al., Nature Commun. (2024) “Deep generative AI models analyzing circulating orphan non-coding RNAs…”; Databricks documentation and blogs.