Activating aviation data with real-time ML
Reduction in compute costs
Now that we know what to build and what to automate and can safely and efficiently iterate on the system infrastructure, there is no limitation on how we can use real-time aviation data.”
In an effort to comply with updated Bureau of Transportation requirements, the U.S. Department of Transportation (USDOT) set out to build a commercial flight database system to accurately measure and report on aviation system performance in real-time. Using the Databricks Lakehouse Platform, the USDOT was able to unify data with analytics, allowing them to feed aviation dashboards real-time information and enabling the delivery of ML-powered insights and decisions. Ultimately, the speed and efficiencies afforded by Databricks allow the USDOT to make accurate predictions about the resources needed to address changes in aviation performance, as well as the resulting impact of fluctuations in air traffic on their operations and the passenger experience.
Connecting divided data, forecasting the future
Commercial airlines typically serve millions of people across the country each year. To help airlines better serve each of those passengers and develop more valuable assessments of their flight operations, the U.S. Department of Transportation needed to unlock aviation data and began developing a real-time, internal commercial aviation flight database (CFD).
By leveraging public data from SWIM — the Federal Aviation Administration’s System-Wide Information Management database — such as weather, flight, aeronautical and surveillance information, the USDOT hoped to more accurately predict future air cargo traffic patterns and deliver clearer financial forecasts to commercial airlines. But combining multiple data services, streams and components from various publishers is complicated, particularly in an industry that’s traditionally used on-premises platforms to manage data, rather than cloud-based solutions like Databricks.
“Building a big data platform is always challenging,” says Mehdi Hashemipour, a USDOT data scientist. “We needed to know what big data we were working on and what infrastructure was necessary to handle it. Designing such a system using various technologies is very complex and, since we know that big data isn’t 100% reliable all the time, it wasn’t nearly as precise as our statistical reporting demanded. We needed to be able to control the quality of the data, and closely monitor any data being processed in real-time.”
The system also needed to scale quickly — without downgrading performance and without any heavy lifting or manual review by the USDOT data team.
Laying the groundwork for predictive analytics to take flight
The SWIM Cloud Distribution Service (SCDS) is a Federal Aviation Administration (FAA) cloud-based service that provides publicly available FAA SWIM content to FAA-approved consumers. To gain critical insight into on-time flights, delayed or canceled flights, and the number of passengers affected by each incident, the USDOT sought to leverage their public SWIM data sets, like traffic flow management and terminal data distribution, and run analytics across all their data to predict multiple scenarios related to flights, airports and passenger behavior.
After careful evaluation, the USDOT chose Microsoft Azure and the Databricks Lakehouse Platform because of the ability to democratize all their various data sources, including on-premises databases like Oracle and Sybase, with real-time streaming via Apache Kafka.
“Databricks gives users an efficient way to create job schedules, scale clusters and run notebooks for analytics and BI,” said Hashemipour.
With Delta Lake, the USDOT is now able to easily leverage all their data and build reliable and performant data pipelines for downstream analytics and machine learning workloads. With both batch and streaming data flowing freely, they’re able to feed information to key tools like Tableau and Power BI to visualize data analytic insights and to machine learning models that enable them to better understand and predict traffic flow patterns and passenger needs.
The USDOT data team is more productive these days, too. With Databricks, all their data teams — from data engineers to data scientists and analysts — are able to collaborate better and be more productive with their data. “We now have a very efficient way to work,” said Hashemipour. “It’s simple to create job schedules, run clusters and notebooks, and organize and analyze data from multiple sources. This has really accelerated our ability to deliver new capabilities that improve the overall passenger experience.”
Reaching for greater heights while keeping costs grounded
With Databricks, the USDOT developed a flexible data environment that can be deployed or extended anywhere to analyze additional SWIM or similarly-streamed information at a lower cost. Through the cost efficiencies of the cloud and Databricks’ automated infrastructure capabilities, the USDOT has been able to implement a unique approach for data conversion and processing, reducing the cost of collecting and ingesting streaming data by 90% as compared to other cloud-based solutions.
Databricks provided everything the USDOT data team needed to automate complex data ingestion and analysis from multiple streams for a variety of analytic and machine learning workloads downstream. Today, they can collate data with minimal effort, in multiple environments if needed, and make more accurate decisions about operations and performance.
“Now that we know what to build and what to automate and can safely and efficiently iterate on the system infrastructure, there is no limitation on how we can use real-time aviation data,” says Hashemipour.
As the system continues to evolve, the USDOT plans to advance its capabilities by opening up their data pipelines to additional analytical and machine learning workloads, further increasing the accuracy of data-driven predictions and decisions, and providing more targeted information to help commercial airlines better serve their passengers.