Discussions about the role of technology in genomics invariably focus on the massive growth in DNA sequencing since the beginning of the century, growth faster than Moore’s law and which has led to the $1000 genome. However, future growth is projected to be even more spectacular, and to be a reality we need more powerful tools for genome analysis. Apache Spark is providing the foundation for these new tools, including two that I will cover in this talk: GATK and Hail, both open source projects from the Broad Institute. GATK and Hail are complementary: GATK provides pipelines for transforming DNA sequence data into the raw material (variant call data) needed by Hail to run genetic analysis across thousands of individuals. GATK started out originally as a single process program, but has now been ported to run on Spark at scale. Hail was written from the outset to run on Spark. In this talk I will look at how these frameworks take advantage of Spark to scale, some of the challenges in getting existing data formats to work with Spark, and some of the plans for the future.
Session hashtag: #EUres9
Tom White is a data scientist at Cloudera, specializing in big data and bioinformatics. Previously, Tom was a distributed systems engineer on Hadoop technologies at Cloudera, where he has worked since its foundation in 2008. Tom is a Apache Hadoop committer and author of "Hadoop: the Definitive Guide" the bestselling book published by O'Reilly Media.