Announcing simplified XML data ingestion

Published: May 23, 2024

by Elise Georis, Peter Pogorski, Sandip Agarwala, Shujing Yang and Ori Zohar

We're excited to announce native support in Databricks for ingesting XML data.

XML is a popular file format for representing complex data structures in different use cases for manufacturing, healthcare, law, travel, finance, and more. As these industries find new opportunities for analytics and AI, they increasingly need to leverage their troves of XML data. Databricks customers ingest this data into the Data Intelligence Platform, where other capabilities like Mosaic AI and Databricks SQL can then be used to drive business value.

However, it can take a lot of work to build resilient XML pipelines. Since XML files are semi-structured and arbitrarily large, they're often complex to process. Until now, XML ingestion has required the use of open source packages or the conversion of XML into another file format, which in turn requires data engineers to maintain these complex pipelines.

To streamline that process, we've developed native support for XML files within Auto Loader and COPY INTO. (Note that Auto Loader for XML works with Delta Live Tables and Databricks Workflows.) This support enables direct ingestion, querying, and parsing without any external packages or file type conversions. Users can also take advantage of powerful capabilities like schema inference and evolution in Auto Loader.

Example 1: Ingest an XML file for batch workloads

For a sample input file containing the following XML:

The query above infers the following schema and parsed result:

Customers also benefit from new, XML-specific features. For example, they can now validate each row-level XML record against an XML schema definition (XSD). They can also use the from_xml Apache Spark function to parse XML strings that are embedded in SQL columns or streaming data sources (like Apache Kafka, Amazon Kinesis, and so on).

Example 2: Ingest an XML file using Auto Loader for streaming workloads.

This example demonstrates schema inference, schema evolution, and XSD validation.

XML data ingestion at Lufthansa

Lufthansa Industry Solutions ingests XML data sources for their Lufthansa Cargo data solution, built on the Data Intelligence Platform. The new XML support has helped the team streamline ingestion and automate much of the data engineering burden. As a result, practitioners can focus on innovation, instead of maintaining complex pipelines.

Next Steps

Native XML support is now in Public Preview on all cloud platforms and is available in both Delta Live Tables and Databricks SQL. Learn more by reading the documentation.

What's next?

November 20, 2024/4 min read

Introducing Predictive Optimization for Statistics

November 21, 2024/3 min read

Example 1: Ingest an XML file for batch workloads

Example 2: Ingest an XML file using Auto Loader for streaming workloads.

XML data ingestion at Lufthansa

Next Steps

Never miss a Databricks post

Sign up

What's next?

Introducing Predictive Optimization for Statistics

How to present and share your Notebook insights in AI/BI Dashboards