Effective AIOps with Open Source Software in a Week

May 27, 2021 03:50 PM (PT)

Download Slides

Classic event, incident, problem and change management are ITSM practices that are getting integrated with DevOps/SRE and ML through a competency known as AIOps. Large streams of data generated through logs, metrics and traces are organized and computed using machine learning algorithms to extract insights on the anomalies of system behavior that could be impacting end-users and business transactions. Businesses cannot afford to see their end-users impacted by those anomalies and therefore would want to proactively predict the likelihood of systems regressing and take corrective action long before any material impact.

In this talk, we show the use of simple linear regression and multivariate linear regression techniques to predict the likelihood of system behavior resulting in one or two sigma of standard deviation. We show how to use FOSS tools to predict them using various decision trees that are integrated to high performing streaming platforms like Apache Flink, Apache Beam, Prometheus and Grafana which makes it a lot easier to visualize the various alerts and triage their way back to performing root cause analysis. These high performing systems are also backed by KAFKA for its streaming and distributed computing capabilities by partitioning the data for various staged analysis some of which can be done in parallel and concurrently based on the use cases. We present a fully integrated architecture that helps you realize a commercial AIOps capability without having to license expensive software products. The above open architecture allows you to implement various ML algorithms as needed and its agnostic to programming languages and tools.

The talk will combine various techniques with demos and is focused to practicing engineers and developers who are familiar with ML.

In this session watch:
Keith Andre Kroculick, Principal Architect, Wells Fargo
Murali Kaundinya, Executive (VP, GM), Wells Fargo



Murali Kaundiny…: Hi there. My name is Murali Kaundinya. I’m here with Keith Kroculick to talk to you about effective AIOps using AI and ML. We built a platform using open-source software. We want to share our experiences with you so you can do the same. We ran a hackathon within team Wells Fargo, and we had our colleagues [inaudible], Keith Kroculic, [inaudible], [inaudible], [inaudible], and myself. We built this platform that we are calling our own AIOps platform, and we had a lot of fun building it using open-source software. So we’d like to share our experiences on how we went about building it.
So the agenda that we’ve got today is we’ll give you a quick overview of what we’re going to talk about. We’ll chat a little bit about our perspective on what AIOps is, and we’ll show you all the open-source components that make up an AIOps platform. We’ll talk about the operational events, how the legacy constructs work in the large enterprise, and how you might want to think about the new way of developing this capability using predictive modeling analytics. We’ll talk about the data collection and the visualization. With that, I’m going to turn it over to Keith who’s going to take you through the remainder of the presentation. Keith?

Keith Kroculick: Hi, good afternoon. My name is Keith Kroculick, principal architect for Wells Fargo. I wanted to discuss a little bit about AIOps and just provide a quick overview. AIOps is the evolution of operations and uses artificial intelligence and machine learning technologies to see and predict the health of a company’s mission-critical systems. AIOps tools harvest data, such as logs, alerts, traces, and other data from applications, server, and infrastructure to build data sets for analytics, event correlation, and to find relationships between events. AIOps technologies can show the graph of dependencies across all environments, as well as analyze the data to extract significant events related to slow down, outages, or other problems. Notifications and alerts can be generated to inform IT staff and predict problems, provide root cause analysis, and recommend solutions. For our hackathon, we decided to architect and design a solution for AIOps.
We needed to architect a solution that could be used to rapidly solve operational issues with our mission-critical systems. The solution that we designed was tailored to solve company-specific Ops-related problems. Almost all COTS AIOps solutions need to be tailored and modified regardless of the vendor. Customizing a COTS AIOps product would be the same if you were to develop it yourself. Components themselves, could be modified or replaced with other solutions and we maintain and own the source code, and we can allow other people to contribute to it.
Some of the technologies that we used in our hackathon AIOps platform was Apache Flink. Apache Flink has a distributed stream processing framework. We used Python Keras, which is a deep learning framework, MySQL database to maintain our data, the Grafana Observability Platform for dashboards, reporting, and alerting. And optionally, we also implemented Prometheus for monitoring and to have a time series database.
The operational events and data that we were ingesting for our hackathon project really was a large and varied set. So we had Ops data from applications, web servers, network devices, and other hardware. And that data in itself really was divided amongst logs, traces, alerts, configuration management, reference and look up details, and summarized metrics. One of the challenges with current operations is the legacy operation platform architecture. Most of the architecture that pre AIOps mainly is around logging and monitoring. So if you look at this diagram, you’d notice that most applications in themselves are instrumented with agents like Splunk and other agents like Elasticsearch or use things like Netcool. Each one of those applications, be it running on bare metal, run into the DM. All you have that agent implemented and they all send logs simultaneously. So if you look at the right-hand side, we really have a chaotic environment and it’s a point-to-point environment. And we have a lot of replication and duplication of data, as well as the agents and software that’s deployed for Ops in general.
So with that being said, we decided to look at designing a new AIOps platform. So this is the AIOps platform Open Architecture. The framework of choice that we decided to use was Apache Flink. Apache Flink, I mentioned is a distributed data-processing platform. From left to right, you notice that we have different events sources, so we can take all of these different events, all these have log files, all these traces, all these alerts, and we can stream them in or batch them or use Apache Flink’s large ecosystem of connectors and pull that data into a Flink job. That Flink job in itself can summarize, it can map relationships, it can add other data sources to your data, it can aggregate, and it can do a lot of other data manipulation on that set of data. The data itself can then be provided to a AI model. So we can take that data set and use machine learning algorithms on it to predict things like applications failing or predict things like outages and anything that might cause an application to perform badly.
After we have done the aggregation and we’ve ran it through the model, we can actually send those results to MySQL database, or we can send it to other destinations. Ultimately, we want to put that data in a place where we can conduct operational BI. That operational BI platform that we’re using is Grafana.
In terms of the events that we ingest, one of the things that is required is to be able to correlate all those events. So all the different data sources that we ingest from our Flink job -app logs, server logs, alerts, configuration management data -it needs to be joined to create a unified data set. So as you can see here, in this example, app logs, we’ve got things like app name, log level, log message, host name, IP address, timestamp. Same thing in the server logs, we’ve got the server log message, log level host, high processed data center, host name. And same thing for alerts, alert, message actions, host name, IP, timestamp, and configuration items, host name, address. All of those data elements need to be combined into a unified data set for us to be able to process and pass to our machine learning models.
So in terms of AIOps and machine learning, we tend to use machine learning and AIOps to do correlation and predictive analytics. So as I mentioned, events from applications, servers, infrastructure combine to create a view of a health of a platform at any point in time. So that’s effectively the graph of events. Our events sources application, server logs, traces. These are all combined into individual datasets that can be used in our machine learning models. And so those machine learning models in this case can be broken up into smaller pieces. So we can create datasets, training, validation, test data sets, depending on the machine learning model. In this case, linear regression. We use Apache Flink stateful functions with PyFlink and the Python Keras TensorFlow libraries to apply our ML models to process our combined Ops dataset. We like to start with a simple linear regression approach to modeling the relationships between our applications, our middleware and servers, really based on the events we ingest. Linear regression, in terms of machine learning algorithms, is really the simplest starting point to use for predicting behaviors.
So a little bit more about the machine learning and predictive analytics. We use Python Keras, which is just a library that uses Google TensorFlow behind the scenes. And we actually just use simple linear regression logic in our Apache Flink application. So in this case, we’re using a supervised learning model. So the steps when building a machine learning model are pretty straightforward. In this case, we’re using linear regression, so it’s pretty simple. We want to create a combined data set and that data set really is used for three things. We want a training data set to be created initially, a validation data set, and a test data set. The training data set and the validation data set really are meant to be used to build a model and be able to take our Ops data and start modifying the model so that we can predict better outcomes.
So the Ops validation dataset in itself can be used to provide feedback to the training dataset. With that being said, after we’ve done creating the Ops validation model, the final output can be compared against a test data set, and we can repeat the process as necessary. So the process in itself is really to create a first dataset, and the training set is your first pass, validate it, test it again. And then finally, when you get to a point that you’re comfortable run it against a test data set, and that should give you a hardened to machine model that you can start using for predictive analytics
So after our predictive analytics have been completed, we’ve got the output of our machine learning models. We need a place and a way to provide dashboards, reporting, visualization, alerting capabilities. And so in this case we’re using Grafana. Grafana is a pretty ubiquitous product. I think Grafana is used for dashboards, visualizations reporting, it can be used for alerting and several other BI needs. The data that we produced by our Apache Flink jobs is summarized, and we want to run that data through our ML model and then load it into a simple, MySQL database. The MySQL database in terms of data, we have aggregated metrics basically using SQL views and simple SQL statements: GROUP BY, COUNT, MAX, AVG. And we expose that to Grafana using Grafana’s MySQL connector.
The database in itself just uses a very simple Ops event schema. So just a handful of tables to hold that data. And we really just want the ability to have a place to put data in so that we can do basic aggregations and then we can create queries to show our relationships amongst the applications, our app components, the servers, and other details in our dataset. And Grafana itself allows us really the ability to create quick visualizations to display our data. So we can use things like histograms, pie charts, scatterplots, and several other visualizations for our data.
You can see here’s a sample dashboard. So this is really the output and a high level dashboard to show the end user just a little bit about our data. So you can see here, we’ve got a tabular view, and really this is incidents for a data center. You can see on the right-hand side that we have aggregations. So we have the number of incidents per data center, and then below that, you can actually see events by impact, and we’ve got them grouped by data center and by the impact value.
So we use MySQL and Grafana; however, that’s just a simple way to do this. If you were actually to take this to production environment, you’d probably use something like Prometheus and time series database. Like I mentioned, we just use MySQL to house our data or AIOps solution just due to its ease of use and availability. The time series database is probably a better fit for reporting on operational events, especially when they come in in real time.
The aggregated metrics that we used for Grafana SQL views, GROUP BY, COUNT BY, MAX, those things can also be used in a time series database. Apache Flink in itself can also write directly to Prometheus using Prometheus Reporter Library. And then Prometheus in itself actually has a native MySQL exporter that can use the existing MySQL database and not require us to change any application logic. So effectively we can just go ahead and export our data out from our MySQL database to our Prometheus database.
So in closing, we discussed legacy operations, our new AIOps solution, building a solution with open-source tools, frameworks and libraries, using event correlation and machine learning, building dashboards and reporting, and just a high level overview of AIOps in general. And I just want to say thank you to the Databricks team for providing us the opportunity to speak about AIOps. And if you have any comments or questions, you can reach us at our respective emails. Thank you.

Keith Andre Kroculick

Experienced program architect and technical director with over 20 years of experience with software development, complex application integration and program management for multiple industries includin...
Read more

Murali Kaundinya

Murali Kaundinya is a senior strategist with technology and architecture with extensive leadership and management consulting experience. He has served in a leadership role conceiving, executing and de...
Read more