Skip to main content

Combining whole genomes to improve patient outcomes

Joint calling of 10,000 whole genome sequences accelerated from 10 weeks to 3 days

genome institute of singapore hh header image color
INDUSTRY: Life sciences

"Databricks enabled us to scale to 10,000 whole genome sequencing using a state-of-the-art robust platform that allows our data teams to conduct effective joint calling for genetic variant discovery."

— Nicolas Bertin, Program Manager, Genome Institute of Singapore

As part of the government’s efforts to transform healthcare in Singapore and improve patient outcomes, the Agency for Science, Technology and Research (A*STAR)’s Genome Institute of Singapore (GIS) is partnering with Precision Health Research, Singapore (PRECISE) to implement the National Precision Medicine (NPM) strategy. With precision medicine as a key focus, their aim is to use genomic sciences to improve patient outcomes and gain new insights into Asian genome and data-driven healthcare solutions. However, conducting genomic variant discovery for 10,000 individuals was not efficient with the current technology they had on hand. With the Databricks Data Intelligence Platform, GIS was able to aggregate and mine genetic variants at population scale, which can then better enable Singapore to deliver precision medicine that improves patient outcomes.

Challenges in scaling to 10,000 whole genome sequences

Singapore has a rapidly aging population with an increased prevalence of chronic diseases. This changes the focus of healthcare delivery from treatment to prevention. To address these demands and remain future-ready, precision medicine remains a key focus.

In Phase I of the NPM strategy, GIS worked with other partners in the Singapore ecosystem to establish a Singaporean reference database containing 10,000 whole genomes.

“In the past, we performed whole genome sequencing on smaller numbers of genomes. However, the results generated from these smaller cohorts were unsatisfactory and not comprehensive. What we required was a solution to complete 10,000 whole genomes via joint calling so that this data can be used to effectively predict which treatment and prevention strategies will work for different groups of people,” said Nicolas Bertin, Program Manager at the Genome Institute of Singapore.

The research institute needed a new way to scale and process the genomic data to reach a deeper understanding of how diseases develop and discover better prevention and treatment plans.

15x improvement in processing of whole genome data

Databricks partnered with GIS to help scale joint calling of 10,000 genomes.

“Working with the Databricks team was a pleasant experience, the data engineers were always beside us to address any queries. The platform enabled us to run 10,000 genomes in less than 72 hours, which was a 15x improvement as compared to before,” said Nicolas.

Leveraging the Databricks Data Intelligence Platform, GIS accelerated processing of this data. The research institute also used the interactive notebooks as the primary interface to set up the analytics pipeline and to explore the data.

“It was crucial that the data generated by our Singaporean NPM had to be compatible with other national precision medicine programs around the world that used the Genome Analysis Toolkit (GATK),” said Nicolas.

Extracting life-saving insights from whole genome data

Databricks led the development of the platform and was able to develop a prototype that was deployed on the Genomics Database within six weeks — a feat that used to take them six months. This greatly enhanced the efficiency in which the GIS data teams accessed the data and provided timely and effective joint calling-based refinement of whole genome sequencing-based genetic variant discovery at a population of over 10,000 individuals, a major milestone of the program.

The team is now moving on to the next phase of the NPM, collaborating with PRECISE to sequence the genomes of 100,000 healthy Singaporeans, some of whom have specific diseases. PRECISE is the central entity set up to coordinate this whole-government effort to implement Phase II of Singapore’s 10-year NPM strategy. It would take over two years to process data at this scale using the old GATK approach, and accelerating the algorithm with Apache Spark™ alone will not solve the problem. Continuous integration and analytics of new genomes can be achieved at this scale by incorporating Delta Lake and Glow, two other open source libraries developed at Databricks.

“Databricks provides us with the necessary platform to strengthen our data capabilities that support this massive national initiative to derive better treatment options for the Singapore population through data analytics,” said Nicolas.