Skip to main content
Company Blog

Data and AI are ushering in a new era of precision medicine. The scale of the cloud, combined with advancements in machine learning, are enabling healthcare and life sciences organizations to use their mountains of data—such as electronic health records, genomics, real-world evidence, claims, and more—to drive innovation across the entire ecosystem, from accelerating drug discovery to preventing chronic disease.

Spark + AI Summit has become the central meeting ground for data scientists, engineers, and bioinformatics teams in the healthcare and life sciences industry to share real-world applications of big data, machine learning, and Apache SparkTM.

This year’s Summit, scheduled for April 23-25 in San Francisco, continues this trend with a full set of technical talks, led by industry leaders at McKesson, Amgen, Optum,, HealthDirect Australia, Collective Health, and many others. Summit attendees can also participate in this year’s Healthcare and Life Sciences Networking Event to share best practices and discuss how best to drive innovation with Spark.

In this blog, we highlight a few of the esteemed presenters within the healthcare and life sciences industry, along with our thoughts on their talks.

Healthcare and Life Sciences Sessions at Spark + AI Summit

A Distributed Deep Learning Approach for the Mitosis Detection from Big Medical Images

The strongest indicator of a cancer patient's prognosis is the number of mitotic bodies identified in high-resolution whole-slide histopathology images. The problem is the need to count these mitosis bodies manually. This talk introduces a large-scale deep learning approach to training a two-stage, CNN-based model with high accuracy, to detect the mitosis locations directly from the high-resolution whole-slide images.

Apache Spark Data Governance Best Practices—Lessons Learned from Centers for Medicare and Medicaid Services

The Centers for Medicare and Medicaid Services (CMS) is the single largest payer for healthcare services in the United States, serving nearly 90 million Americans. CMS’s efforts with Apache Spark are allowing the organization to analyze clinical and claims data from various data sources to produce healthcare models that improve patient outcomes while reducing costs. However, strict data governance is critical due to HIPAA regulations governing access to personal information. This talk covers data governance best practices, including data security, stewardship, and quality management based on lessons learned at CMS.

Unleashing Data Science Nerds on Pharmacy Fraud at McKesson

McKesson's Specialty Health business helps pharmaceuticals manufacturers make costly medications more affordable. However, well-organized networks of criminals routinely exploit the system to generate hundreds of millions of dollars in false claims. The losses are a huge cost to patients and the industry overall. Sifting through millions of pharmacy claims to identify fraudulent claims is no easy task. In this discussion, McKesson’s data scientists share how they’re combating the problem with a powerful fraud-detection model built on Apache Spark and Azure Databricks. This is one of the featured talks at the Summit Healthcare and Life Sciences networking event.

Be Patient: Building Advanced Analytics with the Patient at the Heart

Optum is a leading health services company empowering more than 126 million customers. By leveraging big data and machine learning, the company aspires to improve outcomes, while reducing the cost of care. In this talk, Optum will describe the impact it’s making using claims and clinical data to predict and prevent disease. The company’s data scientists will explain how they’ve unified data engineering and machine learning, using technologies such as MLflow, Apache Spark, and Databricks, to deliver powerful advanced analytics. This is one of the featured talks at the Summit Healthcare and Life Sciences networking event.

Assessing Drug Safety Using AI

Drug discovery and development is a lengthy and expensive process, with an estimated attrition rate of drug candidates of up to 96%, and an average cost of developing a new drug of nearly $2.5 billion. Drug safety regulations account for 30% of drug failures. This talk provides a high-level overview of the rational drug design process that has been in place for many decades, and covers some of the major areas where the use of AI, Deep learning, and ML-based techniques have brought the greatest gains.

Building a Modern Data Platform for the Entire Drug Development Lifecycle at Amgen

At Amgen, data is at the crux of every step in the drug development lifecycle, from discovering new therapeutics to optimizing clinical trials. In this talk, Amgen outlines its vision for building a centralized and integrated data platform across the entire drug-development lifecycle. Over the past five years the company has made significant progress in delivering on its vision, with a cloud native, modular, elastic architecture built on AWS, Spark, and Databricks, and will share key lessons learned along the way. This is one of the featured talks at the Summit Healthcare and Life Sciences networking event.

From Genomics to Medicine: Advancing Healthcare at Scale

Although healthcare practitioners have a wealth of opportunities to tap into massive volumes of genomic data, biostatisticians are still struggling to take full advantage of this opportunity. This talk highlights how the Databricks Unified Analytics Platform for Genomics empowers users to perform end-to-end analyses on our massively scalable platform in the cloud; in just minutes, a data scientist can visualize an individual’s disease risk based on their raw genomic data.

How Australia’s National Health Services Directory Improved Data Quality and Integrity with Delta and Structured Streaming

Healthdirect Australia delivers telehealth and digital health services in partnership with Australian federal, state, and territory governments. With more than 10 terabytes of data covering time-driven, activity-based health care transactions, Healthdirect Australia turned to Apache Spark for large-scale data processing and Delta’s fine-grained table features and data versioning to solve duplication and eliminate data redundancy. This talk will review Healthdirect Australia’s journey implementing Databricks and Spark and how it has enabled the agency to provide high-quality data for downstream analytics.

Accelerating Genomics SNPs Processing and Interpretation with Apache Spark

Genomics is a fast-growing practice that taps into the potential of big data and machine learning. This talk, featuring the head of data engineering at, focuses on how Apache Spark is helping the company to speed up analysis significantly from FASTQ to annotated VCF file.

Building an Agile Development Environment for Healthcare Analytics Pipelines in Spark

Collective Health will share how it develops robust, reliable data pipelines while addressing stringent HIPAA compliance requirements. The rapidly changing nature of the company’s data and requirements means that it needs to be ready to update its pipelines with minimal lead time. Attend this talk to learn how Collective Health’s Data Engineering team has addressed data complexity, changing requirements, and compliance in a highly regulated industry by moving its pipelines to a Spark-based infrastructure.

What’s Next

Read All Our Guides to Spark + AI Summit