Lessons Learned from Deidentifying 700 Million Patient Notes
On Demand
Type
- Session
Format
- Hybrid
Track
- Data Science, Machine Learning and MLOps
Industry
- Healthcare and Life Sciences
Difficulty
- Intermediate
Room
- Moscone South | Upper Mezzanine | 156
Duration
- 35 min
Overview
Providence embarked on an ambitious journey to de-identify all our clinical electronic medical record (EMR) data to support medical research and the development of novel treatments. This talk shares how this was done for patient notes and how you can achieve the same.
First, we built a deidentification pipeline using pre-trained deep learning models, fine-tuned to our own data. We then developed an innovative methodology to evaluate reidentification risk, as American healthcare laws (HIPAA) require that de-identified data have a “very low” risk of reidentification, but do not specify a standard. Our next challenge was to annotate a dataset large enough to produce meaningful statistics and improve the fine-tuning of our model. Finally, through experimentation and iteration, we achieved a level of level of performance that would safeguard patient privacy while minimizing information loss. Our technology partner provided the computing power to efficiently process hundreds of millions of records of historical data and incremental daily loads.
Through this endeavor, we have learned many lessons that we will share:
- Evaluating risk of reidentification to meet HIPAA requirements
- Annotating samples of data to create labeled datasets
- Performing experiments and evaluating performance
- Fine-tuning pre-trained models with your own data
- Augmenting models with rules and other tricks
- Optimizing clusters to process very large volumes of text data
We will also present speed and throughput metrics from running our pipeline, which you can use to benchmark similar projects.
First, we built a deidentification pipeline using pre-trained deep learning models, fine-tuned to our own data. We then developed an innovative methodology to evaluate reidentification risk, as American healthcare laws (HIPAA) require that de-identified data have a “very low” risk of reidentification, but do not specify a standard. Our next challenge was to annotate a dataset large enough to produce meaningful statistics and improve the fine-tuning of our model. Finally, through experimentation and iteration, we achieved a level of level of performance that would safeguard patient privacy while minimizing information loss. Our technology partner provided the computing power to efficiently process hundreds of millions of records of historical data and incremental daily loads.
Through this endeavor, we have learned many lessons that we will share:
- Evaluating risk of reidentification to meet HIPAA requirements
- Annotating samples of data to create labeled datasets
- Performing experiments and evaluating performance
- Fine-tuning pre-trained models with your own data
- Augmenting models with rules and other tricks
- Optimizing clusters to process very large volumes of text data
We will also present speed and throughput metrics from running our pipeline, which you can use to benchmark similar projects.
See the best of Data+AI Summit
Watch on demand