Clinical genomic analytics pipelines using Databricks and the Delta Lake for the benefit of loading individual reads from raw sequencing or base-call files have significant advantages over more traditional methods. Analysis pipelines that perform genomic mapping to purpose-built reference data artifacts persisted to tables allows for enhanced performance that is magnitudes greater than previous mapping methods. These scalable, reproducible, and potentially open sourced methods have the ability to transform bioinformatics and R&D data management / governance.
Andrew Brown: Hi everybody. My name is Andrew Brown. I’m here representing ZS. And today I’m going to be talking about managing research and development data on parallel computer infrastructures, in particular, the Databricks platform.
So the topics we’re going to discuss today, again, give a quick introduction to ZS and what our firm does. We’re going to talk about one particular area, research and development data and specific next-generation sequencing data, DNA and RNA sequences. We’re going to talk about persistence strategies. And then how do we analyze that? How do we map that data to reference genomes and compute it to find and answer scientific questions?
So I’d like to introduce our firm, ZS Associates. So ZS Associates is a larger, approximately 9,500-person global firm that works closely with pharma and smaller emerging pharma organizations to provide business and technology solutions. There’s about 1,200 clients that we’ve supported throughout the past 30 or so years, and ZS is a premier Databricks partner, which brings with it a significant set of expertise in the unified data analytics platform.
One of the subgroups of ZS, which is the group I’m here representing today, is our research and development excellence team. And a quick breakdown of the team. We’re approximately 750 professionals based primarily in the United States, with some global presence as well. We have a significant financial investment into developing business technology needs for research and development data, analytics, and developing technology assets. We’re very experienced in designing clinical trials, patient stratification. We have some products, commercial off-the-shelf like products. REVO Evidence is one example of that as well as a suite of accelerator modules to get your custom IT infrastructure up and running.
So we can break the R&D excellence team into approximately five discrete categories, one being clinical development. These are developing trials, patient stratification, automatically ingesting biometric data from wearables, internet of things type infrastructure. We have a medical affairs division. We have a global health economics and outcomes research. We have a real-world evidence team, which is significantly versed in taking clinical trial data, patient stratification data, and meshing that with the service line I represent, which is biomedical research. So in our biomedical research service line, we specialize in bioinformatics and artificial intelligence and machine learning methods that are applied to scientific datasets to answer questions such as biomarker identification, patient stratification, and target identification.
So this talk is a point of view discussing modern methods to manage and analyze and get more value out of research and development data. And I put this problem statement slide because there’s a significant problem currently in the research and development area. And that problem is there’s approximately 30-plus years of research and development data that has not followed the FAIR, FAIR being findable, accessible, interoperable, and reusable principles of data management.
Since this data is disparate, it doesn’t follow schemas. It doesn’t have a unified persistence layer. It’s very difficult for other members of the organization to find the data, to access it, whether it be business administration policies for having the proper role, making IT requests in order to just be able to read and download the data. So we use the term data democratization being that we want to make it simple. We want to lower the roadblocks to internal organizations in the biopharma industry to be able to leverage and get their data to the stakeholders and the user personas that actually need it. And this is all in an effort to drive drug development and artificial intelligence-based medicine, something we like to term programmable biology, where we can use years and years of previously-curated R&D data to drive novel insights and predictions to answer questions in the scientific domain.
So as I said in the beginning, I specifically wanted to narrow in on the next-gen sequencing research and data type. The reason I chose this for this particular talk is NGS data are enormous in size. One individual flow cell from an Illumina instrument can produce on the magnitude of 200 gigabytes to a full terabyte, depending on the platform, of compressed raw text files. This is extremely large and bulky, difficult to move, and very difficult to analyze in a time-efficient manner.
So I brought some specific steps and some specific checkpoints along the analytic process. And the first, most important piece, in my opinion, is the LIMS, lab information management system, and ELN, electronic lab notebook integration. So those raw specimens are being sequenced. We need to understand what type of specimen is it. There’s a significant amount of metadata that needs to accompany that particular biologic specimen. We need to know what is a patient type. We also need to be mindful of CLIA principles for clinical and regulatory dynamics.
Having the ability to multiplex, which is add more than one specimen to a particular flow cell, allows us to become more economic. However, adds more complexity in terms of demultiplexing that sample to determine what reads belong to what specific patient.
Base calls. So as the instrument reads the DNA, it produces what are called base calls, which are essentially electronic graph amplitudes of what particular base was at what particular position. And this is used to interpret the DNA or RNA base that was actually read off the instrument. This is a very computationally heavy step where we’re converting these electrostatic plots into basically a text or a DNA string readable sequence. This can be performed on the instruments. However, it’s again, like we said, computationally intensive and very, very slow.
ETL. We need to be able to take that fast queue file, that raw text file that’s on the magnitude of 200-plus gigabytes, and we need to be able to ingest it, curate it, prepare the reads, the next step, where we’re cutting off the adapters and the other sequencing preparation methods. We’re cutting those reads off and we’re able to map it to a reference genome. So we’re able to say an individual read belongs to a specific area of the chromosome or of the transcriptome, depending on what type of platform we’re using. These are all massive in terms of compute power, as well as in terms of data persistence.
And I broke this slide into a mature and an immature state of operation, and the mature includes things like having an automated LIMS and ELN integration, where the LIM system already has the metadata for the particular sample and specimen. It’s able to submit that to a CRO or to an internal sequencing operation, and it’s able to integrate that information as that sequencing comes to an end and the data is moved to the analytic pipeline.
Base calling. So this is another step where some clients prefer to do their own base calling using things like the GATK suite from MIT Broad. Some clients do prefer to let it happen on the instrument. And this is a very intricate decision that needs to be made in terms of your pipeline. Do I want my instrument to convert and then move those files to the cloud? Be mindful that moving these types of files can become rather expensive in terms of networking, ingress and egress charges to the cloud. So we don’t want to be constantly moving data from on premises location into the cloud. We want there to be one seamless transition off premises into the cloud, and then all analytics happen within the cloud.
The ETL steps. So you’re going to see later in this presentation where we’re recommending the ingestion of these massive FASTQ files into data objects. We want them to be strongly schemaed. We want them to be able to have individual, discreet tables that they’re able to be represented. And we’re going to talk about the use of Databricks Delta Lake technology to do that.
Read preparation. Again, trimming and being able to map that to some sort of a database or reference genome is a critical step in virtually any NGS pipeline. And we’re going to talk about some strategies on how to pre-compute and make what we refer to as data products. This all in turn gets references into data frames as well as having our reads in tabular format really allows us to leverage the benefits of Apache Spark and the Databricks platform. And then we’ll talk about some downstream analytic use cases where we’re driving AI and machine learning methods to help predict and stratify patients and discover biomarkers.
So as we mentioned in the previous slide, the strategies for ingesting raw data. So first we need to get those FASTQ files into some sort of a data set. We’re a big proponent of the Databricks Delta Lake implementation, whether it be in Azure or in S3, where… or sorry, in Azure or AWS where the S3 persistence is used in AWS and Azure Data Lake Storage Gen2 is used on the Microsoft Azure platform. Defining these Delta tables allows us to get extremely large in terms of trillions of rows with tens of thousands of columns. We’re able to then use datasets to model those FASTQ files at a data frame within Apache Spark. And you can kind of see an example function to the right in that small code block in this slide. This slide essentially ingests the FASTQ file, curates it to a Delta table, persists it into storage, and it allows us to go on and curate these reads where we’re trimming adapters. We’re then able to compare it to a pre-computed data product.
So understanding the type of pipeline or the type of analysis that we’re aiming to answer within the pipeline is extremely critical in terms of developing these pre-computed data products. A data product, in our opinion, is a well-curated data table that’s used in place of a standard reference file. This could include things like pre-computed amplicon regions for the pipeline if it was an amplicon-based pipeline. Could also include things like transcriptome mappings, basically developing a reference of string-based values that are inside of a data frame, allows us to parallelize on Apache Spark. Doing simple joins against the reference data products as well as the actual raw reads allows us to significantly parallelize and multiplex our processes. The object-oriented of Spark datasets also allows for us to maintain different types of header information, different types of quality control information that were contained in the read.
We’re also able to have streaming analytics placed on this platform. So if you’re familiar with other non-aluminum-based platforms like the Oxford Nanopore, you’ll know that that sequences extremely long DNA or RNA reads in real-time. So as that instrument is collecting data, it is in real-time pulling files. For time-sensitive and extremely critical applications, you’re able to develop Spark-structured streaming within the Databricks platform, and in real-time, curate, read, and trim those incoming Oxford Nanopore DNA sequence reads. All of this, depending on the application, is possible with the Databricks platform.
In terms of automating the analysis. So we’ve developed pipelines. We have infrastructure to transport data from the raw instruments into the cloud and kick off analytics pipelines. We can leverage things like the jobs API in Apache Spark. The jobs API allows us to instantiate a cluster and execute a particular unit of work or a particular data pipeline defined in things like Airflow or in Azure data fabric.
So we’re able to dump a file to a particular location and trigger its ingestion, its demultiplexing, and its automated analytics. For more time-sensitive and more scalable applications, we can do things like spin up additional clusters as we demultiplex the entire flow cell. For instance, a flow cell on a NovaSeq might have on the magnitude of 100 or so individual samples that were multiplexed. If it was a very-time sensitive application, we could spin up one cluster per individual sample and significantly enhance our throughput, all the while maintaining control over the expenses by having large, very fast, very performant resources that are able to compute in a very small period of time. We’re also able to implement and add Spark application artifacts. So for instance, if we had a specific JAR package or a specific library dependency, maybe a Python wheel or EG file, we’re able to easily deploy that and define it within the jobs API. And as that job spins up, we’re able to distribute that dependency library to the individual nodes.
On the other side of the house for data scientists and people that are more fluent and want a more intimate experience with the data, Databricks platform provides notebooks for user interaction. So these notebooks allow multiple language implementations of Apache Spark with real-time visualization. Lots of bioinformaticists are very comfortable working with languages like Python and R and were able to use those in real-time inline within a notebook and develop plots, check concordance, check different expression values, develop figures for scientific publications, as well as benchmarking and performance enhancements.
So many organizations already have a particular pipeline in place. Maybe it’s an on-premises, high-performance HPC computing infrastructure. Maybe they’re using vendor-made platforms like Seven Bridges or DNAnexus in order to compute and process their biologic data. What this slide depicts is where we see the industry going in terms of throughput and in terms of scalability. And there’s two main components, one being, how fast are we able to map those reads to reference genomes? Traditional human genome reference mapping times are on the magnitude of four-plus hours using traditional aligners on a single node compute. ZS has experimented with developing multiple curated data products that allow us to cut that time to below one minute using significantly large, scalable Databricks Apache Spark clusters. So in the bottom-left corner, we see traditional mappers like the Burrows-Wheeler Aligner, Bowtie, and these all happen on single alignment or single node alignment-based infrastructure, things that aren’t able to parallelize. And we see that in the lower category as far as being able to aggregate data and being able to have a high amount of throughput.
To the right, the higher, the better, more modern approach are the implementation of the custom data products as well as developing different Spark user-defined functions, UDS, that allow us to do custom analytics, like find areas of overlap, differential expression analysis, things that require [non-regit] space matching. Having the ability to scale like we had mentioned in the case of a multiplex flow cell being demultiplexed and each individual sample spawning an individual cluster allows for a significant amount of throughput where we can see and achieve times to map entire human-based chromosomes to data reference in less than one minute.
And we see this trend where we’re going to see a lot of value in curating and developing a variety of data products that allow us to use these infrastructure components in these scalable methods, all in an effort to lower the amount of time to analyze, to make it more cost-effective for the clients, to follow the FAIR principles of data as well as enhancing the ability to do things such as predict different variational autoencoders that would suggest a particular genetic expression pattern correlates to a disease.
We have this mass of growing mapped patient data. So we’re able to grow that knowledge base with the FAIR principles, and we’re able to scale it out by the Delta Lake technology behind it in an effort to have well-curated datasets and data models that we can now train machine learning and artificial intelligence algorithms on.
So I wanted to quickly talk about a practical clinical use case that ZS had performed for a particular client. Basically, this was a liquid biopsy that was performed, and the cellular-free DNA was collected from the blood and analyzed. So this is an amplicon-based pipeline where we already know the regions of the chromosome that we’re interested in. And in this particular use case, we had approximately 500,000 known sequences of amplicon-based DNA reads that we were looking for in these individual cell-free DNA extracts.
So this particular client had scalability concerns where they were in their clinical trials taking on over 100,000 patients in a very short period of time and the business proposals and the business needs called for that to scale to tens of millions of patients being onboarded per year. This was a very novel pipeline, a very novel clinical assay that they wanted to see get out into the entire world. So there was anticipation that it would be widely used and widely adopted, and there would be a significant amount of data.
There’s about 12 different data tables that we were writing to in the analysis of this pipeline. And each table, each patient would generate approximately 1 to 2 million rows data. So you can imagine the scalability concerns of being able to persist this into some unified data lake. As we mentioned, this amplicon-based pipeline lended itself very nicely to the pre-computed data product and we were able to develop a user defined functions to match and determine the number of particular reads in an effort to determine aneuploidy in the cell-free DNA. As we curated and mapped these massive patient populations to their corresponding amplicon identifications, we were able to cut the mapping time from about four hours to less than one minute per patient. And again, we went with the workflow where a demultiplexed flow cell spun up an individual data cluster per patient and we were able to compute that in a very short period of time and add it to a curated database.
So here’s that process flow in a little more detail. Again, the raw data was ingested. The adapters were trimmed. We were able to identify and map to pre-computed datasets. The post-hoc analysis, these machine learning models, were trained on the growing number of patients that were ingested into the system. So using things like MLflow and a variety of PyTorch, Python, machine learning, and artificial intelligence libraries, we were able to alter and change that machine learning model as more and more data were ingested. All this, in turn, was broken out into technical and scientific value, technical being, what are the performance? What are the compute metrics? How much does it cost to actually run and scale this pipeline in the cloud? And then scientific value in consideration of growing democratized data lakes coming up with modern amplification strategies so we’re not destroying the quality of our reads by amplifying and reading the amplified reads, coming up with novel methods to identify and map. So again, this is lowering the barrier to breakthrough and novel science.
Scalable applications. So as we have that growing data lake, we’re able to develop analytic and visualization applications on top of that. And again, having the growing structured of a consolidated data lake that can be reused across multiple domains. These are all valuable pieces to have as a biopharma organization and different models to look to for other types of data rather than just NGS, things like flow cytometry, things like CAR T therapy. Having the ability to take and curate those pipelines in a similar fashion that curate the data in a data lake and drive downstream machine learning processes is of significant value.
So a recap on what we talked about today. Again, we talked about NGS data persistence strategies as well as NGS mapping and alignment strategies, how we can scale the throughput and how we can scale the persistence to the point where we have one particular data lake that models all our R&D data and allows cross or translational studies, things where we’re taking different domains of data, NGS, flow cytometry, and we’re able to normalize and make more broad interpretations and more broad predictions, train better machine learning models to learn from our R&D data.
So I’m going to attach my contact information. If you have any questions or any particular projects that you’d like to talk about, I could be reached. My LinkedIn contact information is located down below. And as always, feel free to reach out to anyone at ZS if you should have a desire to talk about potential projects. And I thank you for your time today. It was a pleasure.
Andrew joins ZS with more than 12 years of experience in software development and biomedical research. His most recent engagements were providing consulting services through a smaller, Boston-based fi...