Healthcare and life sciences organizations deal with an extraordinary diversity of data formats that extend far beyond traditional structured data. Medical imaging standards like DICOM, proprietary laboratory instruments, genomic sequencing outputs, and specialized biomedical file formats represent a significant challenge for traditional data platforms. While Apache Spark™ provides robust support for approximately 10 standard data source types, the healthcare domain requires access to hundreds of specialized formats and protocols.
Medical images, encompassing modalities like CT, X-Ray, PET, Ultrasound, and MRI, are essential to many diagnostic and treatment processes in healthcare in specialties ranging from orthopedics to oncology to obstetrics. The challenge becomes even more complex when these medical images are compressed, archived, or stored in proprietary formats that require specialized Python libraries for processing.
DICOM files contain a header section of rich metadata. There are over 4200 standard defined DICOM tags. Some customers implement custom metadata tags. The “zipdcm”
data source was built to speed the extraction of these metadata tags.
Healthcare organizations often store medical images in compressed ZIP archives containing thousands of DICOM files. Processing these archives at scale typically requires multiple steps:
Databricks has released a Solution Accelerator, dbx.pixels, which makes integrating hundreds of imaging formats easy at scale. However, the process can still be slow due to the disk I/O operations and temporary file handling.
The new Python Data Source API solves this by enabling direct integration of healthcare-specific Python libraries into Spark's distributed processing framework. Instead of building complex ETL pipelines to first unzip files and then processing them with User Defined Functions (UDFs), you can process compressed medical images in a single step.
A custom data source, implemented using Python Data Source API, combining ZIP file extraction with DICOM processing delivers impressive results: 7x faster processing compared to the traditional approach.
”zipdcm”
reader processed 1,416 zipfile archives containing 107,000+ total DICOM files at 2.43 core seconds per DICOM file. Independent testers reported 10x faster performance. The cluster used had two worker nodes, 8 v-cores each. The wall clock time to run the ”zipdcm”
reader was only 3.5 minutes.
By leaving the source data zipped, and not expanding the source zip archives, we realized a remarkable (4TB unzipped vs 70GB zipped) 57 times lower cloud storage costs.
Here's how to build a custom data source that processes ZIP files containing DICOM images found on github
The crux of reading DICOM files in a Zip file (original source):
Alter this loop to process other types of files nested inside a zip archive, zip_fp
is the file handle of the file inside the zip archive. With the code snippet above, you can start to see how individual zip archive members are individually addressed.
A few important aspects of this code design:
yield
which is a memory efficient technique because we’re not accumulating the entirety of the metadata in memory. The metadata of a single DICOM file is just a few kilobytes.With additional modifications to the partitions()
method you can even have multiple Spark tasks operate on the same zipfile. For DICOMs, typically, zip archives are used to keep individual slices or frames from a 3D scan all together in one file.
Overall, at a high level, the <name_of_data_source>
) as shown in the code snippet below:
Where the data folder looks like (the data source can read bare and zipped dcm files):
A number of factors contribute to 7x faster improvement by implementing a custom data source using Python Data Source APi. They include the following:
"60003000,7FE00010,00283010,00283006")
zipdcm
reader, we save 210,000+ individual file IOs.Taken together, these optimizations shift the bottleneck from disk and network I/O to pure CPU parsing, delivering an observed 7× reduction in end‑to‑end runtime on the reference dataset while keeping memory usage predictable and bounded.
The Python Data Source API opens access to the rich ecosystem of healthcare and life sciences Python packages:
Each of these domains has mature, battle-tested Python libraries that can now be integrated into scalable Spark pipelines. Python's dominance in healthcare data science finally translates to production-scale data engineering.
The blog post discusses how the Python Data Source API, combined with Apache Spark, significantly improves medical image ingestion. It highlights a 7x acceleration in DICOM file indexing and hashing, processing over 100,000 DICOM files in under four minutes, and reducing storage by 57x. The market for radiology imaging analytics is valued at over $40 billion annually, making these performance gains an opportunity to help lower cost while speeding automation of workflows. The authors acknowledge the creators of the benchmark dataset used in their study.
Rutherford, M. W., Nolan, T., Pei, L., Wagner, U., Pan, Q., Farmer, P., Smith, K., Kopchick, B., Laura Opsahl-Ong, Sutton, G., Clunie, D. A., Farahani, K., & Prior, F. (2025). Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation, MIDI-B-Curated-Validation, MIDI-B-Synthetic-Test, MIDI-B-Curated-Test) (Version 1) [Dataset]. The Cancer Imaging Archive. https://doi.org/10.7937/CF2P-AW56
Try out the data sources (“fake”, “zipcsv” and “zipdcm”) with supplied sample data, all found here: https://github.com/databricks-industry-solutions/python-data-sources
Reach out to your Databricks account team to share your use case and strategize on how to scale up the ingestion of your favorite data sources for your analytic use cases.