Of data points processed from thousands of sources
It is widely known that the discovery, development, and commercialization of new classes of drugs can take 10-15 years and greater than $5 billion in R&D investment only to see less than 5% of the drugs make it to market. Understanding that this pace of innovation is not sufficient, AstraZeneca moved to a data-driven approach in order to increase their success rate for drug discovery and safer management of clinical trials.
However, their scientists were still unable to quickly make informed decisions with all of the available scientific information at their fingertips. They struggled with data residing in disjointed sources both within the company as well as external public databases. Furthermore, as new scientific research continues to be released at a rapid pace, it became virtually impossible to keep up-to-date with the pace of scientific discovery.
Infrastructure complexity: Finding the infrastructure that allowed for flexibility but didn’t require constant maintenance.
Massive volumes of disjointed data: Tasked with ingesting, parsing, and analyzing millions of data points across 100s of data sources including internal data sources and public sources including technical literature, public databases, etc.
Struggled to scale operations to support data science efforts with open source Python notebooks.
AstraZeneca leverages the Databricks Lakehouse Platform to help build a knowledge graph of biological insights and facts. The graph powers a recommendation system which enables any AstraZeneca scientist to generate novel target hypotheses, for any disease, leveraging all of the data available to them.
Fully managed platform: Simplified cluster management and maintenance of analytic resources at scale.
Built scale, performant data pipelines: Able to leverage NLP across a huge library of scientific literature and data sources for downstream analysis.
Accelerating machine learning innovation: Data scientists are empowered to build and train models that provide ranking predictions that will help them make smarter decisions.
Since moving to Databricks, AstraZeneca is now able to more easily process millions of data points from thousands of sources. Removing the barriers of scale has allowed them to more reliably extract meaningful insights that can result in novel drugs designed to help people live healthier lives.
Improved operational efficiency: Features such as cluster management and auto-scaling of clusters has improved operations from data ingest to managing the entire machine learning lifecycle.
Better data science productivity: Shared notebook environment with support for multiple languages has improved team productivity.
Faster time-to-insight: The recommendation engine powered by Databricks has improved their ability to make more informed hypotheses, allowing them to accelerate time-to-market for novel drugs and medicines.