Karen Feng is a software engineer at Databricks. She works on Spark SQL and genomics applications on Spark, including Project Glow. Before Databricks, she developed statistical algorithms for genomics at Princeton University.
May 26, 2021 04:25 PM PT
Machine learning practitioners are most comfortable using high-level programming languages such as Python. This is a barrier to parallelizing algorithms with big data frameworks such as Apache Spark, which are written in lower-level languages. Databricks partnered with the Regeneron Genetics Center to create the Glow library for population-scale genomics data storage and analytics. Glow V1.0.0 includes PySpark-based implementations for both existing and novel machine learning algorithms. We will discuss how leveraging tooling for Python users, especially Pandas UDFs, accelerated our development velocity and impacted our algorithms’ computational performance.
October 16, 2019 05:00 PM PT
With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.
The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.
April 24, 2019 05:00 PM PT
With the exponential growth of genomic data sets, healthcare practitioners now have the opportunity to improve human outcomes at an unprecedented pace. These outcomes are difficult to realize in the existing ecosystem of genomic tools, where biostatisticians regularly chain together command-line interfaces based on a single-node setup on premise.
The Databricks Unified Analytics Platform for Genomics empowers users to perform end-to-end analysis on our massively scalable platform in the cloud: in only minutes, a data scientist can visualize an individual’s disease risk based on their raw genomic data. Built on Apache Spark, we provide click-button implementations of accepted best practice workflows, as well as low-level Spark SQL optimizations for common genomics operations.