At Avast we complete over 17 million phishing detections a day, providing crucial online protection for this type of attacks.
In this talk Joao Da Silva and Yury Kasimov will present the MATS stack for productionisation of Machine Learning and their journey into integrating model tracking, storage, cross-system orchestration and model deployments for a complete and modern machine learning pipeline.
One can integrate MATS stack into their existing ecosystem without disruption, no need to migrate to clean AWS all of a sudden.
MATS stack consists of adopting MLFLow, Airflow, Tensorflow and Spark to form a cross-system orchestrated ML pipeline into a standard set of well integrated tools which data scientists at Avast can adopt.
They will use Angler, an internal machine learning project for detecting phishing URLs to demonstrate how MATS stack was leveraged for this ML Pipeline, walking the audience through all stages of the Angler pipeline: data transformations and enrichments in Spark, training of models, experiment tracking and serving of the models. The pipeline is useful for fast and reproducible experiments and it allows a fast progression from research to production.
Speakers: Joao Da Silva and Yury Kasimov
– Hello today we will be talking about MAT Stack, MLFlow, Airflow Tensorflow and Spark for closed system orchestration of machine learning pipelines. My name is Yury Kasimov and I am data engineer at Avast and with me is my colleague, Joao Da Silva. He’s lead data engineer at Avast. We will walk you through our project for building ML pipelines. And first we will introduce our case study, and then we will talk about problems which we are trying to solve, and the goals we are trying to achieve. And then Joao will talk about our technical solution and the challenges that you have faced along the way. And we will end on a high note with success of our project. At Avast we provide security for the customers. And we are using machine learning to improve out all the existing systems. This study as the big study begins in the bottom, when Tamash and Researcher Joao and me, we went to and Tamash was talking about his new project angle. The goal of this project was to use machine learning to improve already existing phishing detection of malicious URLs. We have URLs stored in hadoop and Tamash creates a data set out of this URLs. And he sends a request to other systems which detect the phishing in Avast, and he uses this detections as labels. Then he collects the whole data set to his labels and he uploads the data to another cluster, which has GPUs and he trains convolutional neural network on this cluster. Once the model is trained, Tamash has to move this model to the latest cluster, where the model is set and it receives requests with URLs and responses with is it phishing or not. After Tamash described his project, another friend, Researcher, joined us and he was talking about his projects and issues he faces, and then another Researcher joined and he was describing his issues. And we started to see common problems. All of those researchers are facing. The first issue that he, all of the Researchers are facing is, there is a lot of duplicated effort between different teams and people. Everyone has his own batch scripts to move data, to practice data, how to move models around the different clusters. And, it creates issues, a huge mess. And, this is a lot of wasted effort. Another issue that we identified was, that there is no overview of different experiments in one place. Everyone saves the results of experiments and the parameters in text files or TensorBoard on different machines. But you can just go to one table and see those parameters, those the results, and this is the model. Another issue was that, there is no automated process. How to move from experimenting quiz data and models to production. Once you train your model, you want to deploy this model to bring some value. And this is a quite the challenge. And the last issue we identified was scaling and monitoring of the deployed models. You can deploy the model on several below your desk, but then how do you scale? Or you can deploy it cluster. but then data distribution changes. You need to know about those events as fast as possible and to take action. So we got a little bit overwhelmed to all those issues but then we remember that, the problem is not the problem, the problem is your attitude about the problem. Some of us started to tackle all those issues one by one. And we came up with inside this plan. First, we wanted to define a common ground for Engineers and AI Researchers. The common ground, I mean, to define a common structure of, get to the positories, how the project difficult split into stages and how they are being deployed into production. Also to create a set of common tools so they can be reused as much as possible. The next thing we wanted to address was Orchestration and Scheduling. In a class we have multiple data cluster, We have computational cluster, we have a computational cluster that is GPU’s and we have a Kubernetes cluster. And some projects have to be scheduled on different clusters. Data . Data clusters training on GPU and then deploying it in Kubernetes. And this required us when orchestrating tasks across multiple clusters which is challenging for that one. The next think we wanted to address and solve was to have one place for structured critical experiments. we wanted one table raise all metrics all the results and artifacts, which are used by the pipeline. Also, we want to be able to reproduce the results easily so we should keep track of what amateurs and they wanted to make it fast, make it iteration faster. So the cell shutters can just change one file on committed and the whole pipeline interrupts. And the last goal was to optimize deployment as much as possible. When you deploy a model, you want to make sure that it runs on the same version of code as training. It uses the same processing steps and, it does not require a lot of manual work. So just to recap, we wanted to have a common ground for different teams to have a nice and clear the way of tracking experiments. We wanted to be orchestrate tasks across different clusters and automate model serving. And now I will pass over to Joao who will talk about the auto technical solution.
– Thanks Yury, thanks Yury for the excellent introduction and the background of where we started in what is our plan. Now we have a plan that we have to come up with the solutions and we have to spark a rebellion, need to come up with results. So how do we do that? First, we needed to have this common ground you remention and how do we organize it? How do we organize a machine learning project machine learning pipeline? So we started by dividing the problem into three structure, three sections, such as Lifecycle, Design and Structure. The Project Lifecycle consists of five steps, first being Conception and the Grooming where someone come to our team and explain what they need or what their project needs. We will then study it, evaluate the needs and then we move to the design if the necessary. The design step, we’ll go talk about it at the next slide. Next to once we have a design, we have everything set up. We start with implementation, which means we code everything, the teams collaborate. And at the end, we have a test environment model deployed. Once it’s been agreed upon that the results are acceptable. We define the integration where we do a production deployment set up for monitoring. And then we do the final delivery with the important step being the Retrospective, where we actually evaluate and check how did we do what tools we used and how do we organize the work. Was it positive? How can we improve on the next project? The Design stage, it starts with the data. The data is where we collect the features where we engineered the data from several clusters, where we organize the doc and so on. Once we have that set up with all the bollard pipe ready, then the model and the researchers will do their own pipeline. So then the model learning stages come in where the training scripts are executed they are prepared by the researchers as well as the experience, the experiments are registered. The important part of it is the golden meaning. How we create a continuous deployment. How do we have a structure in an organization in our code? Do we have it easy for everybody to pick up any project we should do. Right? So one of the things what we mentioned when discussing the structure of the project is that having a standard repository templates where everyone who’s joining a ML Project can immediately identify where to look for something and where to go to specific parts that they should be responsible for. How does the documentation in structure and so forth. How does it integrate with the deployment, and so on. Next we have, the goal of having a Standard Machine Learning Development at Avast. How do we come about is that once we have this structure we have this design, we’ll have this life cycle. Now we just need to wrap that into a framework and create these standards for development at Avast. And talking about that, we need also to have the tooling and in this day and age where dozens of thousands of machine learning tools and frameworks are available in the market, we really cannot adopt all of them, and we cannot really be experts in all of them. So we decided to adopt what we call The MATS Stack. The MATS Stack being MLFlow, Airflow, Tensorflow and Spark. And why those tools why we chose those tools is what I’m gonna present next. So we chose MLFlow because we needed experiment tracking and Model Management. But why MLFlow? With other things on the market, why are we choosing MLFlow? First it’s Open Source. It’s easy to track the experiments. It has a clean Rich API, the rest API, the Python API, and CLI client is really easy to use. On the top of it, the most advantage we found is that, model packaging, storage and version management is already prepared for us in a standard way. We just need to take that and prepare deployment based on those packages and rappers. I’ll just show a quick example of how does our angler fishing classifier looks like. I removed some of the other pipelines to remove some clutter but as you can see, before we had no place where we could visualize that, there was a spread across the company and so forth. Now with MLFlow, we are able to just look at the single dashboard and see that. Continuing, we have some metrics, mlflow, you can just visualize your metrics. On the top of it, you can also quickly see that your artifacts, what was logged. What, how does your model look like? How big it is, where can you get it? And then one click button, you can actually register your model for management. So now you have a registered model. How do you, now you have the registered model. You can look at the it’s versions, automatically the versions are updated. You can trigger the deployments, you can change staging to production and so on and so forth, all in one package. Now. Yeah. So, we’re talking about machine learning pipelines, right? So having Airflow give us this option to actually trigger tasks and of the pipeline in different environments, as we previously mentioned. So for instance, we need to trigger multiple environments and so on. So why airflow? The main reason being that it’s a message driven architecture. So we can have Airflow worker spread across the infrastructure, across networks, internal network versus external network, in the cloud or on premises as long as this service, these workers can access your thank you. You can manipulate and target it for task execution. On the top of it is Python, we all left Python and that being said is extensible because we really can do anything with Python, right? So also, I really liked the fact that it does not have much boilerplate if you really look at it, we can use the templating, you can create your own connections and leverage default arguments for that boilerplate thing. So Airflow give us this a good option. So for the angler fishing pipeline classifier we started with nothing, right. And then what does airflow gave us? It give us the fact that we can do our data collection on our spark and HDFS cluster. Moving along, is that okay? So we also need to wait a little bit before the data is ready as we should, and as it should be before we collect other label. So we just execute this delay sensor on a Kubernetes cluster because it’s where we deploy airflow, the Y and scheduler, so it’s close to us. Yeah. And then we can switch back and start the rest of our stages in a cluster, in a yard of cluster as spark cluster. Why is that good again? You can just leverage multiple clusters and multiple networks. And finally, we can tell the hour deck eight, wait a minute, we’re gonna train a model, just execute in our GPU classroom, these GPU machines, and we are done with it. So Airflow is really, really important too on this MAT Stack that allows us to do this distributed, this cross system, orchestration that we mentioned. Tensorflow is one of the elements of the stack, why? It’s a high-performance for training. And it has several advantages that we believe, such as TFRecords, very, very good serializing formats protobuf serialization with the good compression ratios. Thanks, we also have the TensorFlow serving and the C++ server, which is highly performance compared with unicorn in most cases. Also you can auto load models, auto reload models. Everything provided to you as long as you are in the TensorFlow ecosystem, which brings me to the next point. It is a Rich ecosystem. It gave us many things that we need or might not need at all. But the fact that we can use TFRecords, you can use TensorFlow transforms, You can use Tensorflow serving TFX and all the ecosystem around it is an excellent choice and we recommend to our users. Spark of course that’s the King of distributed big data processing the data bricks conference. And indeed there would be no other way for us to prepare the data. Important also is to mention that we have Extensive knowledge and experience with spark at Avast and is a total no-brainer for us to use it for features and to be part of this tech. Really it’s the King of the big data, right? So yeah, that makes the core of the presentation, we’ll introduce you to the MATS Stack and why we have it, and the steps that we need. Now we have solutions we provided the stack, but during the stack, we found some challenges. And I’d like to introduce you to some of those challenges such as for instance, when using MLFlow, there is no way when researcher changes the status of the model in the registry to us, for us engineers, to be notified about that incident or that trigger and deploy the model. So the continuous deployment link is somehow broken in there. There is good thing that mlflow is open source. You can look at the issues and it can look at the discussion. And in fact is something that is already being worked on with high priority. One other thing that we would like to see in place would be Tensorflow model serving support, meaning that when mlflow logs the Tensorflow model does not use the protobuf based format which is required by TensorFlow serving. But again, it’s open source. It’s a disgust and already has a high priority from what I understood in the discussion. Finally, I’d like just to point out that, airflow deploy provide us with a good the distribution of workload in tasks and management, but also it comes at a certain complexity of setup for his deployment. Also security. You need to be sure what you do regarding impersonation and so forth. And we did find some quirks that could be just normal quirks on software, or it could just be that we are not yet a hundred percent familiar with that. But everything is not lost. We actually recognize some of our own success and would like to point out a few of those successes by adopting this stuff. First is that this angler phishing URL pipeline the classifier model is actually delivered. And we learn a lot along the way and allow us to progress. And progress further by establishing processes for faster productization. Now I’ve asked people can come to us. We provide with this process we provide with this framework and these tech and we can quicker have the model from the research machines into production, into a servable format. Also, success we consider is the interest from other teams in adopting our solution. Once we presented this internally, immediately we got some interest and we had the many researchers coming to us and asking questions inquiring, and how they can join us and use our stack. Which again, I would like to phrase this MAT Stack which in fact is a good success point that we focused on a set of tools. And we deal with all the way broad and it give us a good solid starting point for faster productization of the machine learning models at Avast. So this concludes the talk, but before I finish I would like to thank Tomas Trnka, which was the researcher and creator of the Angler phishing classifier. And he gave us this powerful and starting point this use case. Also like to mention our manager Vojtech Tuma that he gave us a guiding and he gave us support. If we had some ideas on how to make it better, they always give us the support and with enthusiasm, which is always good and necessary. Finally, like to say, thank you to our colleagues and to all of those attending this presentation. If you have any problems, please reach out. If you have any questions, please reach out. We are on a Twitter Joao Da Silva and . Thank you for attending this presentation. Any questions will be on the chat rooms available for you.