March 25, 2026

Tevogen Bio’s Journey to Streamlining Life-Saving Therapies

Accelerating drug discovery with data and AI

The Innovation: Tevogen Bio is leveraging its ExacTcell platform and proprietary PredicTcell AI models to modernize and automate the traditionally slow, $3 billion drug discovery process.
The Challenge: To overcome the "bottleneck" of manual wet-lab testing and multi-terabyte data silos, Tevogen partnered with Microsoft and Databricks to build a massive, governed data platform on lakehouse architecture.
The Results: By processing 16 billion datapoints, Tevogen compressed a 50-day research cycle into just 24 hours, achieving 93–97% recall in its alpha model to deliver faster, more affordable therapies.

Accelerating the Decade-Long Process of Drug Discovery

Drug development costs upwards of $3 billion and requires an investment of 10-12 years of time to bring a product to market. These directly contribute to issues associated with accessibility and cost for a given product.

Tevogen Bio created the patented ExacTcell platform to determine targets against any given viral, oncological or neurological disease for a single HLA restriction to address these issues. The initial target selection for its proof-of-concept trial on a single viral candidate, SARS-COV2 was performed via manual methods. The single HLA restricted product, while capable of addressing a majority of the population, required a significant time and resource commitment, taking between 18-24 months to test and confirm through wet lab science.

To meet Tevogen’s mission statement of providing faster, cheaper and more accessible care, Tevogen.AI partnered with Microsoft and Databricks to optimize their core platforms scientific understanding, while aiming to streamline and accelerate their pipeline to additional indications.

The challenge statement was to ingest and create a library of protein sequences across a spectrum of diseases to allow scientists and researchers to transform a process that once took months into a matter of days and subsequently hours.

Further, this dataset will be used to train Tevogen.AI's patented foundational algorithmic models backed by Tevogen Bio's proprietary science. Tevogen’s executive team also provided the challenge of curating a dataset of known genetic proteins to train the algorithmic model to predict immunologically active peptides using machine learning methods.

The Bottleneck: Wrangling Multi-Terabyte Datasets

To curate this dataset the team faced a unique challenge where a multi-terabyte scale dataset had to be procured and organized with the relevant features to facilitate algorithmic training. This presented two major problems:

Creating data pipelines to quickly procure and organize relevant information with multi-level cleansing and filtering, and
Converting a process designed to run serially, in parallel.

This is where Databricks proved to be a critical partner.

Architecting a Modern Data Lakehouse with Databricks

We selected the Databricks Platform as the base of our modernization efforts. Leveraging the power of Medallion Architecture, and Unity Catalog we architected numerous pipelines to carefully store data into bronze, silver, and gold layers while maintaining strict governance and fine-grained access control.

Leveraging the power of distributed computing along with the cleaner structure we were able to bring down the time taken by processes from 50 days to 24 hours. The medallion architecture also served as the foundation for developing various machine learning (ML) models.

Thanks to the experts from their Professional Services team, with personal acknowledgement to Vibhor Nigam and Mohamad Abafoul, Tevogen.AI was able to perform processing at scale and amass a dataset comprising of 24 million proteins which were then refined and sorted to derive 16 billion datapoints and ~700 million unique peptides from the Bronze to the Silver layers of the medallion architecture. In addition, we have been able to curate ~37 million cross matched expert articles.

From Data to AI: Training the PredicTcell Model

Anyone who has worked in bioinformatics understands that this is no small feat to perform within a matter of months. As this process took place, the team was able to work in parallel, creating an MLOps framework to allow for automatic training, inference, monitoring and retention. Upon the completion of the initial phase of the engagement, the team was able to deliver the alpha version of the PredicTcell model trained across traditional XGBoost methods and ESM models, ultimately delivering 93-97% recall and 38-43% accuracy.

Further, the expansion of the datasets allowed for Tevogen’s scientific team to gain and provide new insights into the model training cycle, thereby refining the training methods through each iteration. We continue to add additional features to our training set, such as, quickly assessing expert articles with RAG integration using Agent Bricks coupled with biochemical properties.

Looking Ahead: Unlocking the Holy Grail of Medicine

As the training kicks off for the Beta version of the PredicTcell model and we begin the work on the alpha version of our AdapTcell model, Tevogen.AI is uniquely positioned to create state of the art predictive models for peptide to protein binding affinity with increasing accuracy, a key to unlocking the holy grail of medicine.

With their proprietary models, Tevogen.AI is confident that they will be able to achieve their ultimate goal of predicting the binding peptide for any protein, novel or otherwise, with a very high degree of accuracy.

“Adding determinism to a probabilistic workflow is the key to unlocking success. Balancing the in-vivo/in-silico trial-and-error process is something that every biotech company should be focused on for drug development.” said Mittul Mehta, CIO – Tevogen and Head – Tevogen.AI.

“I am extremely pleased with our relationship with Databricks and Microsoft as each brings the best capabilities to the table to allow us to innovate continuously and reaching Tevogen’s goal of providing affordable and accessible therapies for large patient populations. I look forward to continuing working with both these excellent partners to innovate in AI for drug development.”

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.

View all blogs