Skip to main content

Building a FHIR-native health data platform on Databricks Lakebase

by Marcin Jimenez, Aleksandr Kislitsyn and Nikolai Ryzhikov

  • Health Samurai standardizes clinical data from HL7v2, C-CDA, and X12 into FHIR at ingestion, with terminology normalization and patient deduplication baked in
  • Aidbox runs natively on Databricks Lakebase, making FHIR data instantly available for Spark, ML, and AI without ETL or data movement
  • The architecture delivers compliance with CMS-0057 and ONC mandates as a byproduct—not a separate workstream

Healthcare data lives in dozens of systems, EHRs, claims, labs, pharmacy, SDoH, each with its own formats, codes, and duplicates. Turning this fragmented landscape into a unified, FHIR-standardized, and trusted data foundation is a key step towards better outcomes, smarter operations, and regulatory readiness. In this blog, you’ll learn how Health Samurai & Databricks give you the technologies to build that foundation on open standards, at any scale.

Today, intelligent healthcare applications don't live at the edge of the business. They run the business; from closing care gaps proactively to powering real-time member engagement to ensuring regulatory compliance by design. But these applications demand a data foundation that most healthcare organizations have struggled to build: one that is standardized, governed, and accessible to every tool in the stack without moving data between systems.

What if your operational intelligence and your analytics capabilities were unified and truly interoperable, driving the same insights?

The challenge: Fragmented data, fragmented governance

Healthcare's data landscape is uniquely complex. Patient information is spread across HL7v2 messages, C-CDA documents, X12 transactions, and proprietary formats, each system encoding the same clinical concepts differently. A single diagnosis may appear under multiple codes across multiple vocabularies. A single patient may exist as several records across several systems.

The traditional approach to unifying this data involves standing up a FHIR server for interoperability, a separate data warehouse for analytics, and a web of ETL pipelines connecting the two. Each system maintains its own access controls, audit trails, and compliance posture. 

This duplication is costly. The same clinical data is replicated across the FHIR server, the warehouse, and multiple staging layers — each adding storage, compute, and operational overhead. Meanwhile, the FHIR server itself often becomes a bottleneck. Most implementations were designed for transactional use cases — document exchange, point lookups, regulatory APIs — not for the access patterns of modern analytics, ML pipelines, or AI agents that need to scan millions of resources efficiently.

As a result, organizations are forced into trade-offs: over-provision FHIR infrastructure to maintain performance, or extract data into yet another system to make it usable.

The outcome is predictable: slow data movement, fragmented governance, and stalled AI initiatives — because models can’t reliably access clean, trusted, and well-governed data where it’s needed. Costs increase, while flexibility decrease; you can’t build intelligent care applications on top of siloed, inconsistent, and poorly governed data.

The vision: One dataset, every tool, no data movement

Imagine a single platform where clinical data is standardized to FHIR at the point of entry — where that same data, without any movement or transformation, is immediately available for Spark analytics, ML models, AI agents, and BI dashboards. Where compliance isn't a separate workstream but a natural property of the architecture. Where every tool, from the EHR to the data scientist's notebook, sees the same governed, trusted data.

This is what Health Samurai and Databricks have built together.

How it works: Health Samurai

Aggregate and standardize

The first mile of data quality determines the last mile of insight. Health Samurai provides the technologies and expertise to collect and standardize data from diverse sources into a unified, FHIR-native data foundation.

Everything in this layer is built with interoperability in mind. Data formats and APIs are based on HL7 and X12 — including FHIR R4/R5, HL7 v2, C-CDA, and X12. Clinical meaning is represented using widely adopted code systems such as LOINC, SNOMED CT, RxNorm, and ICD-10. Conformance to specific use cases is defined through FHIR Implementation Guides like US Core, CARIN Blue Button, Da Vinci PDex, and mCODE — with additional code systems and IGs incorporated as regulations and partner requirements evolve.

This is a deliberate architectural choice, not a checkbox. Open standards mean ensuring your data model isn’t locked into a singular vendor. The same FHIR resources that power interoperability today can support analytics, AI, and future applications without rework. Switching tools shouldn’t require re-modeling your data.

Key capabilities include:

  • Open-source HL7v2, C-CDA, and X12 converters transform legacy data into FHIR — the modern standard for healthcare interoperability.
  • FHIR-native Terminology Server normalizes codes across vocabularies, ensuring one diagnosis is counted once regardless of source system.
  • MDM/MPI (Master Data Management / Master Patient Index) deduplicates patient records so one patient equals one golden record.
  • FHIR Implementation Guides and Validation enforce data quality and conformance at the point of entry — not after the fact.

The result is clean, standardized FHIR data with a single golden record per patient. Quality and transparency are foundational and not an after-the-fact approach.

Health Samurai helps configure these pipelines and tools for each organization's specific data landscape.

Access everywhere — Zero ETL

This is where the architecture becomes transformative. Aidbox — Health Samurai's FHIR Server and Database — runs natively on Databricks Lakebase.

Lakebase is a fully-managed, serverless Postgres database integrated into the Databricks Data Intelligence Platform. Because Aidbox runs directly on Lakebase, FHIR data is immediately available across the full Databricks toolkit — no ETL required.

Data is replicated through Moonlink, a real-time synchronization engine between operational and analytical formats, with zero ETL. This allows FHIR data to flow seamlessly into the analytical layer, eliminating the dependencies for pipelines, transformation, or delays.

This creates two complementary access patterns from a single dataset, both powering your analytics and your operational workloads:

  1. Databricks-native access: Spark, SQL, ML, AI/BI — for analytics, data science, and AI
  2. Standards-based access: FHIR API, SMART on FHIR, and SQL on FHIR ViewDefinitions (a new HL7 standard that flattens nested FHIR resources into tabular views for analytics)

What you can build

With unified FHIR data and the combined power of Health Samurai and Databricks, organizations can flexibly address their specific challenges:

EHR optimization and value-based care

Clinical and administrative decision support powered by Databricks AI connects back to EHR and billing workflows through SMART on FHIR and CDS Hooks. This enables:

  • HEDIS/STARS scoring and quality measurement
  • Risk adjustment and HCC capture optimization
  • Contract analytics and shared savings tracking
  • Agentic AI that closes care gaps proactively — not retrospectively

The FHIR-native foundation means insights flow directly back to clinicians at the point of care, embedded in their existing workflows.

Member engagement at scale

Build meaningful relationships with patients and members through:

  • Patient portals with FHIR API as the backbone — standards-compliant by design
  • Personalized outreach at scale using propensity models on Databricks to determine the right channel, message, and timing for millions of members
  • Patient Access API included as a natural property of the architecture

Compliance — built in, not bolted on

By building on FHIR, organizations address mandates like CMS-0057 (Interoperability and Patient Access) and ONC requirements as a natural property of their architecture:

  • Patient Access Rule compliance
  • Payer-to-Payer data exchange
  • ONC Health IT Certification readiness

Compliance is not a separate project; it's a byproduct of doing things right.

Why this matters now

CMS and ONC regulatory deadlines are fast approaching, and AI is moving from pilots to production — but only on trusted, governed data. The traditional approach of maintaining a separate FHIR server, a separate analytics platform, and ETL pipelines connecting the two is too slow, too expensive, and too fragile for the demands of modern healthcare.

Lakebase future-proofs your interoperability investments. Your FHIR server runs on your Data Intelligence Platform. Your clinical operations and your analytics share the same source of truth for information. Unity Catalog governs everything from operational data to insights and AI. And open standards mean the flexibility of no vendor lock-in.

Get started

Health Samurai and Databricks — open technologies for your Health Data Platform.

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.