The System Wide Information Management (SWIM) Program is a National Airspace System (NAS)-wide information system that supports Next Generation Air Transportation System (NextGen) goals. SWIM facilitates the data-sharing requirements for NextGen, providing the digital data-sharing backbone of NextGen. The SWIM Cloud Distribution Service (SCDS) is a Federal Aviation Administration (FAA) cloud-based service that provides publicly available FAA SWIM content to FAA approved consumers via Solace JMS messaging. In this session we are going to show case the work we did at USDOT-BTS on Automating the required Infrastructure, Configuration, Ingestion and Analysis of the following public SWIM Data Sets:
SWIM provides a lot of information which with proper validation and analysis you can discover interesting insights like:
It could also be extended to analyze and predict multiple scenarios related to flights, airport and passenger behavior. Our initial solution leverages Azure Databricks, Apache Kafka and Spark-XML among others, but it is a flexible environment that could be extended to analyze more SWIM or similar streamed data sources, and since we are using Infrastructure as a code and Configuration Managements tools, it could be deployed and/or extended anywhere.
– Hello everyone, my name is Mehdi Hashemipour and I’m a Data Scientist at the US Department of Transportation.
I’m leading an ongoing project at the Bureau of Transportation Statistics, intending to build an internal commercial aviation flight database system. This presentation is about one of BTS efforts to achieve this goal by doing a proof of concept to automate the data ingestion and analysis of the FAA SWIM data using Microsoft Azure and Databricks.
So, I will give you an overview of the project and then Marcelo and Sheila will discuss the infrastructure, architecture and the data processing part respectively.
So, building the commercial flight database system helps BTS to measure and report aviation system performance. And some of the potential benefit of that is enabling timely estimate of enplanements and on-time performance, supporting especial aviation studies and providing a source of data to aviation dashboards and other statistical products, and finally serving as the aviation component of the transportation disruption and disaster system.
So most of these objectives could be achieved by using the FAA SWIM service as a major component of the commercial flight database system. Let’s just start with understanding what is the SWIM service? SWIM, it stands for system wide information management, which is the combination of multiple data services, including airports, flights, aeronautical and weather data.
SWIM is the digital data delivery platform that turns raw national aerospace system or NAS data into meaningful information for aviation stakeholders. Consumers can connect to the system with a single connection through a source FAA telecommunication system, where they can retrieve data coming in from producers in real time. So, in order to develop the system for the proof of concept, we used one of the SWIM services called TFMS data and TFMS or traffic flow management service, also referred to as air traffic management is the craft of managing the flow of air traffic in the national aerospace system, based on the capacity and demand. TFMS service provide aircraft situation display data and also it include aircraft scheduling, routing and positional information. So, a terminal data distribution system or STDS, is another SWIM service that convert legacy terminal data collected from airport towers and terminal radar approach control facilities into easily accessible information. So, STDS SWIM gives user access to data from over 200 airports. And this SWIM data comes from over almost 400 individual systems into a SWIM, with XML format data. And the volume of data that we are receiving, by turning on this SWIM stream is huge, for example, for one week course of TFMS data, we collected over 15 million XML messages, which takes terabytes of a space. And ingesting and processing these messages is challenging and require fast and reliable data processing workflow. Here in this slide is the architecture for different SWIM services, which consists of many services and data component that are getting a stream by various producers, where they are from FAA itself to other external publishers. And the main categories of data are including weather, flight and flow and many other aviation related data components. So, using the SWIM data and more specifically the TFMS data, would help BTS to answer questions like airport airline performance, airport time delays, on-time performance estimates by cause and so on.
Also air cargo traffic, economic impact of flights delays,
diversion and cancellations, such as financial impact of the flight delays and flight path analysis, operational and passenger impact of flight delays and cancellations such as flight pattern interruptions and late arriving flight patterns. And finally the passenger impact of flights delays, diversion, and cancellation.
So, the initial conceptual architecture that we design your environment, is consisting of getting connected to FAA SWIM service, through either a virtual machine approach or an HD side approach by using Apache Kafka for streaming direct similar messages, then processing these messages and storing them in a structure or semi-structure format into a data Lake, making the data ready for further analysis. To implement this system architecture, Marcelo will discuss more about the development phases system component and its deployment in the DOT Microsoft Azure environment. – Thanks Mehdi, so let’s talk about the infrastructure.
So, the initial goals from the infrastructure point of view, were clear since the beginning of the project. We wanted to help data scientists like Mehdi and Sheila getting access to a SWIM and internal on-prem data source, so they could do and they start their work as soon as possible. And not to worry about building the required infrastructure, all the post-provisioning configuration tasks that come with it, like system configuration, database resources, et cetera. Making sure that it’s secure for internal security standards. And of course should be able to scale based on demand or type of workload. In summary, we had to automate as much as we could.
So they could have a working environment quickly and ready to be used Based on these goals on the conceptual architecture, showed by Mehdi before, we came up with this physical architecture, which from the infrastructure point of view, gave us a pretty good idea of all the cloud resources and service that needed to be scripted and automated.
A couple of things to notice here, is that we are using the solace source connector in Kafka in order to ingest the data. Solace provides source and sink connectors that you can build and easily deploy them in your data cluster. And with some minor configurations, you are set to ingest basically any extreme data source you want. And best of all is open source. Second, we are using plain open source Apache Kafka, which makes the solution portable. If needed, you could easily deploy it somewhere else.
We focus on automating these three categories, infrastructure itself, which includes all the initial networking, subnets, security groups, VMs, Databricks workspaces, storage, everything. Second, Kafka, now we are on the post-provisioning configurations, which includes Kafka installation, configuration along with its software requirements like Java, for example. It all symbols extreme required configuration to access a specific data sources, having the right combination files in the right place. The solace connector to ingest the data and it’s required dependencies and libraries. And finally, Databricks, which involves cluster creation and configuration of required libraries, like Spark-XML to parse and analyze the data, import initial notebooks and setting up secrets to properly mounting access to data Lake.
At this point, we knew what we needed to build and what we wanted to automate.
Now we had to come up with the right tools. We started with Terraform for obvious reasons. So far is the best tool to have infrastructure as code, especially if you are in a multi-cloud environment at least thanks to it’s great number of providers and common language across all of them, has support for all the resources we wanted to build, allow us to reuse code using modules, customize it using variables like VM sizes, for example, that might change based on the environment and type of workload, maybe production is gonna require a beefy machine, whereas development is just a small test on a small VM. We can also review the changes before they are applied, which is great. And Terraform cloud, which allow us to store the state of this infrastructure, so we can have a distributed team, not just one person clicking through the portal.
Once the environment was ready, the next step was to do all the post-provisioning configurations. For that, we wanted to provide options. So we created chef cookbooks and Ansible roles, both provide the same functionality we needed, by building these configurations as code, it also works as documentation on how these services were configured in order to access the stream under latest sources. This’ in case we wanted to access different SWIM data sources than the initial ones we are using.
So we will be able to scale. We avoid configuration drip, which is a very common problem when you have multiple environments. In summary, we have consistency across all our resources and environment, which is a huge win. Basically no more snowflakes.
For Databricks’ specific configurations, We use Databricks CLI and this is still experimental, but it provided everything that we needed, like import initial notebooks, create clusters using Json templates, insert libraries like Spark-XML and much more.
Again, still under development, but very powerful, highly recommended if you want to interact with your Databricks environment, using CLI. I know the UI, which is perfect for automation.
So, since we are having everything as code, it just made sense to start in GitHub and use GitHub actions to orchestrate it.
GitHub actions allow us to create workflow automations for any OS, any language in any cloud. It even comes with some predefined actions that you can use as a starting point. Again, highly recommend if you want to orchestrate this automation pipelines without much effort.
So putting it altogether, we have this law where everything starts with the developers pushing code to GitHub.
So we can pull requests PRs, so their changes can be reviewed and approved. Behind the scenes GitHub actions workflows get triggered, which result in changes either the infrastructure, Kafka or Databricks configurations, you name it.
Finally, here are some things that we learned along the way, the initial set up might take some good time, having your tools like Terraform chef or on GitHub all of those properly configure with proper permissions and access takes some time. You need to have some sort of service account, with the right permissions to create, modify or delete cloud resources. There is also some learning curve, especially if you are not used to having infrastructure and configuration as code, and the GitHub law, having branches, submit PRs to review code before they supply it and things like that. There was a lot of tests and fail at the beginning, mostly because of security, since all the traffic has to be internal and some end points were not open or needed to be open. And those things just also take some time. But at the end, everything worked out just fine. Now we can deploy the solution easily with minimal effort, in multiple environments if it’s required or needed.
Finally, all this code is open source and can be found here. Please feel free to use it on extended.
Now for the best part of the presentation, Sheila is going to show us a cool demo of the end result, thank you. – Thank you, Marcelo. What I am going to show you now, is what we did with the TFMS data and how we processed with it.
So what you see here is our idea of what the future state architecture would be. And what that would be is to include, either relational databases. So something like Oracle and Sybase, which have historical data in it and couple that with the streaming data and bring that in to a raw Delta bronze table using Spark and Databricks runtime. And then from there batch processing to parse the XML data and do schema validation against the very complex nested TFMS schemas, and then do joins and aggregations from there to be able to really get power out of your data and be able to perform your analytics or predictive analytics, machine learning processes on top of that. One thing to note in the demonstration that I’m about to show you, is that we look specifically at the TFMS data stream. The future state, however, includes the other data streams coming in from SWIM data. And another thing to note is that the environment that I’m going to show you is a mirror of what was implemented at BTS. What you’re looking at now is the Databricks platform. The architecture from this slide that I just showed, we’ve implemented in a series of three notebooks within Databricks. So this is a simplified diagram of what we’re seeing. Again, we’ve mirrored, the BTS environment by hooking up from Kafka to the SWIM data, using the solace connector that Marcelo talked about. From there, we consume the messages into Databricks and immediately stream out to a bronze Delta level table. That’s the raw XML strings. From there, we in a batch processing mode where we specify a time period, we bring in those XML messages from the bronze table and we parse it using the Spark-XML package and write that out to our silver level Delta tables. From there, we can choose the time period, bring in the data from the silver level tables, during that period and do joins and aggregations of the different message types, to be able to start looking and doing our data exploration analytics against the data. We’re going to focus on this last line right here, where we’re going from silver to gold. But one thing I do want to point out before I jump right into there, is show you what the bronze level XML messages look like. And you’ll see that right here. That the XML messages are really hard to just look at and understand what’s going on in them. So the next step was to parse it using the Spark-XML package. If we scroll down, we can see what that would look like, for the messages themselves. Then we’re able to flatten them into a nice looking data frame, where we can easily see different data within the data frame itself. Now we’re going to jump right into the gold level table so we can see what the data looks like and understand these TFMS messages. One thing to note here, is that there are 20 different message types coming into TFMS and we are consuming them all and writing them to the silver level tables. So, what we’re going to do, is take that silver level data and enrich it with reference data. This is publicly available reference data on dealing with what the airport codes are, what cities are located in, as well as the airline information, what the codes are compared to the airlines. Now, if we scroll down, we can see that we take, we’re going to focus on departure arrival, as well as flight plan creation information and bring all of that data together. We can see, we have about 30,000 messages in departures and 30,000 in arrivals as well as about 11,000 flight plan creation messages. So what we’re gonna do is look where we have information for all of them, so we can start understanding the delays. So how the schedule departure and arrivals compared to the actual arrival and departures.
We can see here, the flight plan tables. So you can easily see that we have all of the information available to us. Now, when we join this information together with flight plan, with the actual arrivals and departures, we can calculate delays on arrivals and departures. And when we start looking at it and visualizing that data, we can start understanding where there are larger delays and arrival and where are we landing early. So you can see that between San Juan and Memphis, there were larger on average arrival delays during the four days of data that are taken. This data was taken between May 25th and May 28th, non-inclusive of 24 hours. Within the BTS environment, this data is running 24 hours a day, seven days a week. So they are consuming all of the data on terabyte or petabyte scale and the same notebooks are running there. So these are actually consuming a very large scale of data. For the sake of this demonstration, I’m showing a small portion of it. Again, we can start even doing more analytics by calculating distance between airports and start looking at what about the elapsed flight time compared to what the flight plan said the elapsed flight time should be. Let’s start getting metrics as to where are we overcompensating for the flight plan and where are we not giving enough time for that flight to happen? And so we can begin to start kind of correlating these to potentially weather patterns or wind patterns. And if we keep going, we can pivot and start looking at cancellation messages because within the TFMS data, there are 20 different message types. So we have a lot of data at our fingertips. So in this case, we have 1,265 recorded cancellations of flights within those four days and those hours that we have data for, for those four days in this case. If we enrich it with that flight, with the airline and flight and flight and airport information, we can get a little bit of our, start diving into the data and understanding that American airlines seem to have canceled the most flights during those four days. We can also do analytics and see what routes were canceled the most. In this case, Albuquerque to Phoenix was canceled 56 times during those four days. So maybe we start looking at weather patterns and maybe there were big storms in that area. So that’s something that we can start diving into and analyzing why these cancellations are occurring. The same time by looking at American airlines, we can start doing some sort of analysis on where are these cancellations are occurring. So these are different departure airports from American airlines flights. We can see that from Miami, most flights were canceled. Well, Miami is American airlines hub. So that’s not necessarily too surprising, but where are these flights going that they’re being canceled? So if we look into what’s landing in Miami, or what’s departing from Miami, we can see that the flights going to Havana Cuba, that four, or that excuse me, eight were canceled during that time. So from this, we can see that by using Spark-XML to take these XML messages and flatten them out, we can really start diving into the root cause of why different delays and cancellations are happening, bringing in different streams of data. We can start to do predictive analytics. We can look at seasonal flights and see how the seasonal flights are changing in terms of cancellations in different areas of the US, I mean start diving in more, thank you. So we learned a lot of lessons along the way with implementing the data processing and looking at the XML messages coming in. The first is, the Spark-XML package is very powerful and that it’s improving and actually doing this project with BTS, we were able to actually improve that Spark-XML package to deal with these very complex and nested schemas. And currently we’re, we’re still looking into ways that we can improve it even more. So this is a constantly evolving package that’s becoming more and more usable. Next is to do the schema validation, we actually ran into every XML message that we were loading up the different scheme of files to, and the nest is schema. We’re bringing it in from storage, opening them, and then validating our data against it. So what we did to mitigate any file IO latency was actually take that schema and copy it out, teach it to the executor’s within the cluster. And what that allowed us to do is have that schema in memory cache, so that way we could easily and very quickly validate against the schema. So we went from validating on the order of minutes, to actually validating in just seconds.
The last thing we learned is that with the XML messages coming in, that if we infer that complex schema off of those messages as they come in, well, not all messages are as complex as they can be. And so if you’re inferring against a small set of those messages, could be that when a very complicated message comes in, you could get a schema mismatch and have issues with that. So what we found is based on requirements, that we could batch process on the order of an hour and be able to account for all of those complexities within the schema itself of these XML messages.
So where are we going from here? Well, the first thing we’d like to do, is take that data that we went through and processed and validate it against the data that’s being provided by the airlines.
So when the airlines give the data to BTS, they’re able to look at it and we can look at it against the SWIM data to see if the same information is available through these SWIM data messages. Also, we’d like to take a deeper dive into actually advanced analytics and look at predictive modeling, understanding the seasonal issues and flight delays and how that’s going to affect passengers. Many of the use cases that Mehdi alluded to earlier in this presentation. And lastly, we’d like to open up this pipeline and allow more SWIM data feeds. So not just TFMS, open it up to ST, DDS and the other streams coming in that we saw on the earlier slides. So lastly, I would like to say that if you’d like to contact us and learn any more about the project or talk with us more, our contact information is here and available for you. And very lastly, I’d like to thank the US Department of Transportation, as well as the Bureau of Transportation Statistics, and especially Mehdi for allowing us to collaborate.
Marcelo Zambrana is a cloud solution architect at Microsoft focused on infrastructure automation and application development. He has been helping federal, state, and private industry customers for multiple years, including banking and healthcare. Marcelo specializes in enterprise hybrid architecture, cloud migration, cloud service models (PaaS, IaaS, and SaaS), automation, infrastructure and compliance as a code, and configuration management. In short: big DevOps fun with a little bit of an OCD problem on the details.
Sheila Stewart is a Solutions Architect at Databricks focusing on federal civilian customers. She has a BS in Physics from the California Institute of Technology and an MS in Nuclear Engineering and Radiological Sciences from the University of Michigan and has over 15 years of data analysis and algorithmic development experience in feature extraction and classification for Department of Defense customers.
Mehdi Hashemipour is a Data Scientist at USDOT. He has a Ph.D. in Systems Engineering focused on Artificial intelligence from George Washington University, where he is now an Adjunct Professor at the School of Engineering and Applied Science. Mehdi has over 10 years’ experience in Building Machine Learning, Deep Learning, Simulation Modeling, and Decision Support Systems.