HomepageData + AI Summit 2022 Logo
Watch on demand

Lessons Learned from Deidentifying 700 Million Patient Notes

On Demand

Type

  • Session

Format

  • Hybrid

Track

  • Data Science, Machine Learning and MLOps

Industry

  • Healthcare and Life Sciences

Difficulty

  • Intermediate

Room

  • Moscone South | Upper Mezzanine | 156

Duration

  • 35 min
Download session slides

Overview

Providence embarked on an ambitious journey to de-identify all our clinical electronic medical record (EMR) data to support medical research and the development of novel treatments. This talk shares how this was done for patient notes and how you can achieve the same.

First, we built a deidentification pipeline using pre-trained deep learning models, fine-tuned to our own data. We then developed an innovative methodology to evaluate reidentification risk, as American healthcare laws (HIPAA) require that de-identified data have a “very low” risk of reidentification, but do not specify a standard. Our next challenge was to annotate a dataset large enough to produce meaningful statistics and improve the fine-tuning of our model. Finally, through experimentation and iteration, we achieved a level of level of performance that would safeguard patient privacy while minimizing information loss. Our technology partner provided the computing power to efficiently process hundreds of millions of records of historical data and incremental daily loads.

Through this endeavor, we have learned many lessons that we will share:

- Evaluating risk of reidentification to meet HIPAA requirements
- Annotating samples of data to create labeled datasets
- Performing experiments and evaluating performance
- Fine-tuning pre-trained models with your own data
- Augmenting models with rules and other tricks
- Optimizing clusters to process very large volumes of text data

We will also present speed and throughput metrics from running our pipeline, which you can use to benchmark similar projects.

Session Speakers

Nadaa Taiyab

Senior Data Scientist

Tegria

Lindsay Mico

Director of Data Science

Providence Health

See the best of Data+AI Summit

Watch on demand