With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.
The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.
Karen Feng is a software engineer at Databricks. She works on Spark SQL and genomics applications on Spark, including Project Glow. Before Databricks, she developed statistical algorithms for genomics at Princeton University.