Frank Austin Nothaft

Technical Director of Healthcare and Life Sciences, Databricks

Frank is the Technical Director for the Healthcare and Life Sciences vertical at Databricks. Prior to joining Databricks, Frank was a lead developer on the Big Data Genomics/ADAM and Toil projects at UC Berkeley, and worked at Broadcom Corporation on design automation techniques for industrial scale wireless communication chips. Frank holds a PhD and Masters of Science in Computer Science from UC Berkeley, and a Bachelor’s of Science with Honors in Electrical Engineering from Stanford University.

Past sessions

Summit Europe 2020 Panel Discussion: Improving Health Outcomes with Data + AI

November 17, 2020 04:00 PM PT

To drive better outcomes and reduce the cost-of-care, healthcare and life sciences organizations need to deliver the right interventions to the right patients through the right vehicle at the right time. To achieve this, health organizations need to blend and analyze diverse sets of data across large populations, including electronic health records, healthcare claims, SoDH/demographics data, and precision medicine technologies like genomic sequences. Integrating these diverse data sources under a common and reproducible framework is a key challenge healthcare and life sciences companies face in their journey towards powering data driven outcomes. In this session, we explore the opportunities for optimization across the whole healthcare value chain through the unification of data and AI. Attendees will learn best practices for building data driven organizations and hear real-world stories for how advanced analytics is improving patient outcomes.


  • Iyibo Jack, Director Engineering, Milliman MedInsight
  • Arek Kaczmarek, Exec Director, Data Engineering, Providence St. Joseph Health

Speaker: Frank Nothaft

Large scale genomics datasets like the UK Biobank are revolutionizing how pharmaceutical companies identify targets for therapeutic development. However, turning petabytes of genomics data into actionable links between genotype and phenotype is out of reach for companies using legacy genomic data technologies. In this talk, Biogen will describe how they collaborated with DNAnexus and Databricks to move their on-premises data infrastructure into the AWS cloud. By combining the DNAnexus platform with the Databricks Genomics Runtime, Biogen was able to use the UK Biobank dataset to identify genes containing protein-truncating variants that impact human longevity and neurological status.

Summit Europe 2018 The Future of Healthcare with Big Data and AI

August 11, 2022 05:52 AM PT

Summit 2018 Saving Lives with Unified Analytics

August 11, 2022 05:52 AM PT

Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models AI has always been one of the most exciting applications of big data and Apache Spark. Increasingly Spark users want to integrate Spark with distributed deep learning and machine learning frameworks built for state-of-the-art training.

Summit 2018 Genomics Demo

June 5, 2018 05:00 PM PT

Summit 2018 Scaling Genomics Pipelines in the Cloud

June 5, 2018 05:00 PM PT

Next generation sequencing is becoming cheaper and more accessible. The volume of data sequenced is increasing faster than Moore’s Law. However, it is still expensive and slow to go from raw reads to variant calls, and to produce annotated variants that can then be analyzed downstream. In this talk, we will discuss the first state of the art, scalable and simple DNA sequencing workflow that is built on top of Apache Spark and the Databricks APIs. The pipeline is simple to set up, is easy to scale out, and can sequence a 30x coverage genome cost efficiently on the cloud.

We'll introduce the problem of alignment and variant calling on whole genomes, discuss the challenges of building a simple yet scalable pipeline and demonstrate our solution. This talk should be of interest to developers wishing to build ETL pipelines on top of Apache Spark, as well as biochemists and molecular biologists who wish to learn how to develop cheap and fast DNA sequencing pipelines.

Sesson hashtag: #DevSAIS10

Summit 2014 ADAM: Fast, Scalable Genomic Analysis

June 29, 2014 05:00 PM PT

ADAM is a high-performance distributed processing pipeline and API for DNA sequencing data. To allow computation to scale on clusters with more than a hundred nodes, ADAM uses Apache Spark as a computational engine and stores data using Apache Avro and the open-source Parquet columnar store. This scalability allows us to perform complex, computationally heavy tasks such as base quality score recalibration (BQSR), or duplicate marking on high coverage human genomes (> 60%, 236GB) in under a half hour. In tests on the Amazon Elastic Compute platform, we achieve a 50% speedup over current processing pipelines, and a lower processing cost.
To achieve scalability in a distributed setting, we rephrased conventional sequential DNA processing algorithms as data-parallel algorithms. In this talk, we’ll discuss the general principles we used for making these algorithms scalable while achieving full concordance with the equivalent serial algorithms. Additionally, by adapting genomic analysis to a commodity distributed analytics platform like Apache Spark, it is easier to perform ad hoc analysis and machine learning on genomic data. We will discuss how this impacts the clinical use of DNA analysis pipelines, as well as population genomics.

Summit 2016 Processing 70Tb Of Genomics Data With ADAM And Toil

June 7, 2016 05:00 PM PT

Modern genome sequencing projects capture hundreds of gigabytes of data per individual. In this talk, we discuss recent work where we used the Spark-based ADAM tool to recompute genomic variants from 70TB of reads from the Simons Genome Diversity dataset. ADAM presents a drop-in, Spark-based replacement for conventional genomics pipelines like the GATK. We ran this computation across hundreds of nodes on Amazon EC2 using Toil, a novel cluster orchestration tool. Toil was used to automatically scale the number of nodes used, and to seamlessly run large single node jobs and Spark clusters in a single workflow. By combining ADAM and Toil, we are able to improve end-to-end pipeline runtime while taking advantage of the EC2 Spot Instances market. Additionally, Toil is designed for scientific reproducibility, and our entire workflow was run using Docker containers to ensure that there is a static set of binaries that could be used to reproduce the pipeline at a later date. ADAM and Toil are both freely available Apache 2 licensed tools.

Summit East 2017 Processing Terabyte-Scale Genomics Datasets with ADAM

February 7, 2017 04:00 PM PT

The detection and analysis of rare genomic events requires integrative analysis across large cohorts with terabytes to petabytes of genomic data. Contemporary genomic analysis tools have not been designed for this scale of data-intensive computing. This talk presents ADAM, an Apache 2 licensed library built on top of the popular Apache Spark distributed computing framework. ADAM is designed to allow genomic analyses to be seamlessly distributed across large clusters, and presents a clean API for writing parallel genomic analysis algorithms. In this talk, we’ll look at how we’ve used ADAM to achieve a 3.5× improvement in end-to-end variant calling latency and a 66% cost improvement over current toolkits, without sacrificing accuracy. We will talk about a recent recompute effort where we have used ADAM to recall the Simons Genome Diversity Dataset against GRCh38. We will also talk about using ADAM alongside Apache Hbase to interactively explore large variant datasets.