Drug development is notoriously slow and expensive. The average Research and Development (R&D) lifecycle spans 10-15 years, with a significant portion of candidates failing during clinical trials. A major bottleneck has been in identifying the right protein targets early in the process.
Proteins are the "working molecules" of living organisms—they catalyze reactions, transport molecules, and act as the targets for most modern drugs. The ability to rapidly classify proteins, understand their properties, and identify under-researched candidates could dramatically accelerate the discovery process (e.g. Wozniak et al., 2024, Nature Chemical Biology).
This is where the convergence of data engineering, machine learning (ML), and generative AI becomes transformative. In fact, you can build this entire pipeline on a single platform – the Databricks Data Intelligence Platform.
Our AI-Driven Drug Discovery Solution Accelerator demonstrates an end-to-end workflow through four key processes:
Let's walk through each stage:
Raw biological data rarely arrives in a clean, analysis-ready format. Our source data comes as FASTA files—a standard format for representing protein sequences that looks something like this:
To the untrained eye, this sequence data is nearly impossible to interpret—a dense string of single-letter amino acid codes. Yet, by the end of this pipeline, researchers can query this same data in natural language, asking questions like "Show me under-researched membrane proteins in humans with high classification confidence" and receiving actionable insights in return.
Using Lakeflow Declarative Pipelines, we build a medallion architecture that progressively refines this data:
The result: Clean, governed protein data in Unity Catalog, ready for downstream ML and analytics. Critically, the data lineage that extends beyond this stage to the other stages (highlighted below) provides incredible value for scientific reproducibility.
Not all proteins are created equal when it comes to drug discovery. Membrane transport proteins—those embedded in cell membranes—are particularly important drug targets because they control what enters and exits cells.
We leverage ProtBERT-BFD, a BERT-based protein language model from the Rostlab, fine-tuned specifically for membrane protein classification. This model treats amino acid sequences like language, learning contextual relationships between residues to predict protein function.
The model outputs a classification (as Membrane or Soluble) along with a confidence score, which we write back to Unity Catalog for downstream filtering and analysis.
Classification tells us what a protein is. But researchers need to know why it matters—what is the recent research? Where are the gaps? Is this an under-explored drug target?
This is where we bring in LLMs. Leveraging both Databricks' Foundational Model API as well as External Model endpoints, we create registered AI Functions that enriches protein records with research context.
We bring everything together in an AI/BI Dashboard with Genie Space enabled.
Researchers can now:
The dashboard queries the same governed tables in Unity Catalog, with AI Functions providing on-demand (or batch processed) enrichment.
What makes this solution compelling is not due to any single component—it is that everything runs on one platform:
| Capability | Databricks Feature |
|---|---|
| Data Ingestion & ETL | Lakeflow Declarative Pipelines |
| Data Governance | Unity Catalog |
| ML inference | GPU Compute |
| LLM integration | FMAPI + External Models + AI Functions |
| Analytics | Databricks SQL |
| Exploration | AI/BI Dashboards + AI/BI Genie Space |
Critically, there is no data movement between systems. No separate MLOps infrastructure. No disconnected BI tools. The protein sequence that enters the pipeline flows through transformation, classification, enrichment, and ends up queryable in natural language—all within the same governed environment.
The complete solution accelerator is available on GitHub:
github.com/databricks-industry-solutions/ai-driven-drug-discovery
This accelerator demonstrates the art of the possible. In production, you might extend it to:
The foundation is there. The platform is unified. The only limit is the science you want to accelerate. Get started today.
Healthcare & Life Sciences
November 14, 2024/2 min read
Product
November 27, 2024/6 min read