Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models.
In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.
Jeshua Bratman: Hi, everyone. Thanks for coming to our talk. We’re talking about continuous machine learning, integration and delivery for advanced email attack detection. I’m Jeshua Bratman. I’m the head of machine learning for a company called Abnormal Security. We focus on stopping advanced email attacks and also giving the presentation with me is Justin Young who’s on our data platform team.
So, what is the type of problem that Abnormal Security is trying to prevent? What I have here is an example of a type of advanced email attack that we try to stop. And this is what’s called an invoice payment fraud. In this attack, you see someone posing as someone named Josephine Wright, is sending an email from a power company to someone on the accounting department of a hospital, asking for them to pay $883,000 with a routing number and account number attached. This is fraud. This is actually going to send money to the attacker’s bank account. And they’re going to run away with the money. As you see, this is a pure text type of cyber attack. This is pure social engineering. And so tackling a problem like this really does require machine learning. And this is just one example of the type of attack that Abnormal Security tries to prevent.
There’s a whole range of email attacks, and I have kind of a taxonomy here, and this is everything from annoyance emails like spam and scam to these very advanced fishing, spear phishing, malware type attacks that in recent years have been doing horrible things like shutting down hospitals, making them pay ransoms before they unlock their computer systems. Spear phishing tries to get credentials of organizations, sometimes this is criminal organizations sometimes it’s spy organizations, to the right-hand side of these very rare and advanced invoice fraud attacks like I just showed you.
And these attacks are often very rare. So for example, the very advanced attacks can be as rare as one in 10 million, one in 100 million, or even less in terms of the number of emails. Most emails are legitimate emails. Some of these emails are these cyber attacks. And one thing to note is if you don’t know much about the cybersecurity industry, in fact, something over 85% of all cyber attacks initiate from email. So this is a big deal. Every organization in the world is susceptible to this. And as more and more people do work from home, there’s more and more of these attacks coming all the time. So it’s an important problem and it’s a hard machine learning problem to solve. And there’s a lot of reasons why it’s hard, but number one is the rarity of the attacks, as I already mentioned. Stopping something that’s one in 100 million is very difficult to build a high precision detection around.
Another really important thing about this problem that’s different from any other machine learning problems is that it’s adversarial. Attackers aren’t just going to sit around trying the same attack over and over if it fails. They will be constantly adapting their attacks, both to check people who are being trained to try to identify these, but also to get around the detection engines like Abnormal Securities detection engine. We’ve actually even seen on the dark web there’s software available to attackers to run their own machine learning algorithms to constantly modify their attacks in such a way to get around detection engines. This makes it very challenging because we can’t just build a model and we’re done.
Additionally, it’s a very high dimensional problem. There’s tons and tons of data available in even a single decision, all the text in the email, the headers, the past communication patterns, images, links, attachments, et cetera, et cetera, et cetera, and a very high data volume. We have millions of emails we’re trying to monitor. And then we need an extremely high precision and recall at the end of the day. Every false negative means an attack that gets through. Every false positive means a possible important email that gets stopped.
How do we do this and how does this relate to the topic of this talk? We have a strategy for going after this. We have to move fast. We have to have a lightning speed of iteration to get ahead of new attacks. But we have to move fast, but don’t break things. Breaking things means our system no longer catches attacks that we previously caught. That means possibly opening up an organization to a very damaging breach. So, unlike the old adage, move fast and break things, which was more in the light of a early consumer facing company, in this case of this cyber security product, it’s very, very important not to break things. And developing a machine learning product with this move fast, don’t break things idea is difficult. And the way that we solve it or really help ourselves solve it is with this continuous integration and delivery system for our entire machine learning detection engine. And I’ll talk about what this means.
This is not something people usually talk about. What is CICD for ML? So what does this mean? Well, let’s take a step back. What is a CICD system and what is it for? Normal, traditional CICD that’s just for code involves an engineer modifying some code, making a change to a repository, pushing that, getting a code review on it, hopefully, sending it into a CICD system. That CICD system runs tests. And essentially it’s a… it maybe more complicated than this, but at the high level, if the tests pass, we land and deploy the code. And this should be familiar to most of you.
The question I first will ask is, this is something that’s pretty much always adopted, but what if you didn’t use this at a software organization? What would happen if we did not have a CICD system? Well, we’d have no idea if a code change submitted would break the system. There’d be nothing making sure that would happen. We’d sort of rely on engineers for testing it, before they deploy their code. Engineers will end up fixing each other’s bugs all the time. We’ll end up pushing this bad code actually to our production systems and likely have outages. And really, I probably am preaching to the choir here, in modern software development, it would be insane not to have a CICD system.
Okay, so this is for just code. The whole point of this talk is about CICD for an entire machine learning stack, but what does that even mean? Well, for a machine learning stack ML engineers are modifying a lot more than just code, right? They’re modifying code, they’re probably deploying new models and they’re also introducing new data sets. The data sets are pieces of data that we be picked up, put into features that are then going to go into models, that then will change the decision that the system comes out with. Because remember at the end of the day, in Abnormal Securities case, this machine learning engine takes an email, at the end of the day produces a detection of whether this is a cyber attack or not.
So when the ML engineer modifies any one of these three things, we need to do more than just run tests. In fact, it’s hard to build unit tests over all these models and data sets. So we do want tests, but we also want rescoring analytics. Rescoring analytics will ask, is the performance of the system still good? Are we stopping all the attacks that we used to stop? Are we stopping these new attacks that maybe have come in and our new improvements are meant to detect? And additionally, can we continue to train models? We need to make sure that the new code and models and data sets are still able to produce the features that will go into model training. These are the three pillars of the tests that need to be run for a machine learning system before we decide that it’s good to ship out to production. But this is a lot harder to do than just a traditional CICD system.
But here’s the same question I asked before. What if we don’t have something like this? What if we don’t have this? And this is kind of the status quo most of the time. Well, we can’t change the system very easily to fix a false negative or false positive with any confidence that something else doesn’t break. We might degrade the system unintentionally when we ship improvements and this will hurt our customers and potentially leave them vulnerable. And we can’t know the overall impact of a new model to the system. The individual engineer who’s developing a model, might measure the impact of that model by itself, but to know the impact of the entire system at once, it’s difficult to have this automated.
And most ML products run completely blind like this. I think probably many of you working at ML, problems don’t have something like this, and I’ve definitely worked in situations where this doesn’t exist and it really slows down development speed and product stability because you often ship out degradations. And it’s just very difficult to evaluate the improvements to the system. I want to get a little more concrete and help walk through… So, okay, first of all, we want this. This is sort of the point of this slide, is we want to do this. So let me give you a better example of what this looks like in practice for Abnormal security.
I talked about this invoice payment fraud earlier, and as I mentioned, this is an adversarial problem. Attackers are constantly developing new strategies. And one other strategy beyond this invoice payment fraud that we’ve seen is something on the right here. The attack on the right here is very similar, except instead of asking to pay an invoice, the attacker says, “Hey, just want to update you. We recently had to switch banks, long story, but our account number has changed for future invoices. See the attached document for updated banking details.” So rather than asking for an immediate payment, they’re asking to change the database of the account number so that a later payment will go to the fraudulent account. This is what we call a billing account update fraud.
So if our ML engineers are trying to improve the system to detect this, the first thing they’re going to do is just look at the missed attack and develop the system in various ways to improve it. And what I’m trying to emphasize here is, all of the different parts of the system that might be improved in a situation like this. So let’s walk through this particular example. Let’s just notice here, the sender domain of this attack is edisonP-O-V-V-E-R.com. This is a common approach called a lookalike domain. So the attacker has registered this domain that looks like edisonpower.com, but it’s not.
Maybe to improve our detection for this new attack, we build a counting feature that counts how often the sender, in this case Edison Power, uses a particular domain to send their invoice conversations. So we’re counting up how much this happens in the past, and we may find that Edison Povver, we’ve never seen before. So the ML engineer is going to put this into features, write code for the new feature extractors, they’re going to need to write a database that keeps track of all of these counts, and then probably incorporate this feature into a model.
Additionally, for the content of this message, maybe our NLP models didn’t pick up this conversation about switching bank account numbers. So maybe we need a new NLP model that identifies this changing banking details. And also maybe we don’t parse PDFs, let’s say. So we may need to put new code and a new system in place to download and parse PDFs to extract bank account numbers from them. So, anyway, there’s lots of changes that may go into improving the system in this particular case. And you can kind of translate this to other machine learning problems when you’re trying to improve against a false negative. So, as I mentioned before, the ML engineer is modifying three of these things, code to improve the detection engine, add features, new models, like new NLP models that are being pushed out, new data sets, like this counting feature data sets. All of these feed into the ML detection engine in various ways. We may have new detectors built on top of these models or new features that lead into the existing models.
The CICD system is going to run the ML detection engine over all past labeled examples, produce rescoring analytics that’ll tell us what’s the precision, recall and other metrics that we want to know, and also produce the feature sets needed for other model training with all the new features generated. So these are the two core things that the CICD system does. And this is not easy to do, and we want to make sure it is effective. So the requirements for a good system like this is, it must be accurate. So the rescoring analytics that are produced have to reflect the performance that this new version of the system will actually have in production. So we have to trust it.
Also, the training data that is produced on the old samples has to include all of the new features in an unbiased way. Those features need to be as they would appear at production, including dealing with the time travel problem to avoid a few future leakage, which means any information from the future cannot feed into past samples. And we’ll talk about this a bit more in the second half of this talk. Another important requirement of this is it has to lead to ML engineering effectiveness.
So what do I mean by that? I mean, the jobs of the ML engineers in the team need to be easy to do. It needs to be easy for them to run the CICD systems so they can quickly evaluate and retrain experiments. They also need to very easily be able to make these modifications, new models, new data sets new features. We don’t want them to go having to make complex integration tests and complex data pipelines just to test their changes.
So this is the requirements for a good CICD system. I’m going to hand it off to Justin on our data platform team to actually talk about how we built that here at Abnormal.
Justin Young: Thanks, Jesh. I’m Justin, I’m a software engineer on our data platform team at Abnormal Security. In this next part of the talk, I’m going to be talking about how we actually built the system in a way that enables our developers to be as effective as possible, and also scales in a way that lets us meet all those requirements that Jesh just mentioned.
So the first thing to note about this problem is that it’s really a big data problem. So in traditional CICD, for the most part, you’re just changing code. So it’s possible to mock out certain dependencies that allow you to really simplify your tests and just test the code that you’re changing at a given time.
But as we just heard in CICD for an ML platform, we’re changing three things anytime we add a new feature. We’re changing the code, but we’re also changing the models and the data sets that we’re using in our detection engine. So if we want to do a really proper end-to-end integration test, it’s necessary, not only to test the code, it’s also necessary to test our models and our full data sets. Because all three of those components are really part of the logic in the software system that we’re testing.
But there’s at least a couple of things in the situation that constitute a big data problem. The first is the historical samples that we’re evaluating. And as Jeff mentioned, those samples are really rare and precious. And so we can’t throw any away. We have to test every single sample when we want to evaluate our ML detection engine. The second piece of big data is the data sets that we want to join into our data. And there are potentially dozens of data sets that we want to join in and they can range from megabytes to terabytes and as discussed, these are part of the logic that we’re testing, so they really can’t be mocked out or watered down.
And so when we have a big data problem like this at Abnormal, our tool of choice to solve it is Apache Spark, but it turns out that things actually get really complicated quickly. Our plan here is to build an offline version of our online detection engine in a series of Spark jobs, but the data in particular is something that’s pretty complicated to get just right.
So if we go back to our example here with the billing account update fraud attack, let’s take just a single feature where we count the domains from this new lookalike domain. So a data engineer might look at this new kind of attack and think, okay, a way that we can teach our ML models to recognize this kind of attack in the future is, if we just count all the domains that we’ve ever seen from a given sender. And so what we should end up seeing is that Josephine Wright from the domain edisonpower.com spelled correctly, maybe happens 1000 times in the last 30 days. And from the incorrectly spelled domain, it’s probably never going to be seen in the last 30 days. And so that should indicate that something Abnormal is happening.
So this sounds like a really good feature, but how does the data scientist actually get it into production? Well, they’re going to change all three of those components that we just described. They’re first going to build up this data set of these domain counts, they’re then going to have to build some feature extraction code to take these counts and turn them into an input that the models can actually use during scoring. And finally, they’re going to have to build, or rather train, either a new sub model, a new version of an existing model, or just generally change our model stack to incorporate this new feature.
But as we discussed, it’s not as simple as traditional CICD where you can just mock out some dependencies and add a unit test or even an integration test, because the data in the models really can’t be mocked out. And so, the question now is, is this data scientist going to have to do a bunch of data engineering work to add these new features to our offline Spark job that simulates our ML detection engine? Maybe that’s fine. Let actually just walk through an example of what that would look like in Spark to see if it’s pretty reasonable.
So in Spark, there are a few different ways that you can add data sets to a Spark job. The first two that you see here, SparkFiles and broadcast variables are two different flavors of making data available to the code, the logic, in a Spark job. SparkFiles download data to disk on every executer. Broadcast variables make it available in memory. But the really important thing here is just that this is pretty simple to use. It’s really just a few lines of code to add one of these kinds of variables to your Spark job. And so this is actually a pretty reasonable to ask a data scientist to use when they’re adding a new feature to the rescoring pipeline.
But then, one problem with these two flavors of adding new data sets is that they really only work well, at least in our experience in our environment, when the data set is about less than 100 megabytes. And once it starts to get larger than that, the job becomes pretty inefficient. And we find that we have to use a proper Spark join to join in the data. So okay, maybe we’ll try to do the Spark join, but there’s actually one other problem to think about. And this is the time travel problem, which is fairly well-known in the machine learning ops community. This was actually a problem when all of the kinds of Spark joins, or rather, data set joins that we were going to have to do, but it definitely becomes a lot more complicated when we do a full Spark join.
And so the time-travel problem is the fact that when you’re backtesting a sample, you have to provide the features to it that it would have gotten exactly as it would at scoring time. If we go back to our Edison Power example, imagine we just took the latest value of that feature, so the last 30 days from just today. That might work, but imagine that domain was turned off. It stopped sending emails. It was really important that we saw that there were 1,000 samples of that in the last 30 days at scoring time, so if we just take the latest value and that’s gone to zero, then we’re going to lose some information that we would have had at scoring time and our evaluation is not going to be correct.
So, okay, how do we actually solve this? In this part I’ll walk through what it looks like for the data engineer to actually solve this time travel problem, do the full Spark join, basically starting from scratch, with just Spark code. And so you see here at the top, we have basically our raw data for this domain count data set. They’re just daily counts of these sender domain [inaudible] for each day. You can see here, the blue indicates maybe our common sender and the yellow indicates our uncommon domain with the misspelled [inaudible]. And so we said that we want 30 day sums at each time, so the first thing we’ll do is actually do that 30 days sliding sum so that we have a value that’s actually the proper feature we want every sample to see.
We then have our events. And similarly, we’re going to first key them by that same value, that the blue and the yellow indicating the different domains and also bucket them by time so that we can join on the right time component. So after doing that to both sides of our join, we have common keys, this time component and the domain component. We actually do a join and then finally apply a hydration function to insert the count into our message for use later on by our feature extraction code and the models to score it.
So, okay. That already got a little bit complicated. What you see here now is actually a very simplified version of the code that we use to do this at Abnormal. And it actually even gets a little bit more complicated than the example that I just walked through. The reason is, one of the requirements for our rescoring pipeline was that it be very efficient so that it allows our engineers to iterate very quickly. And it turns out that the example that I just walked through, doesn’t scale very well when the samples on one side of your join are very large. So we actually have to do something even more complicated, where we first join by an event ID and then join into the full message later on. The details aren’t actually super important. The main takeaway that I want to give you here is that, the code you’re seeing here is way, way more complicated than we should be asking a data scientist or an ML engineer to be doing every single time they want to add one new feature to our data set.
So if we go back to the reason we’re doing all of this in the first place, it was really to enable our data scientists to iterate our new features really quickly, make sure that they’re catching novel kinds of attacks, and also not degrading the performance of our system on all of our prior known attacks. And so if this is the story that we are asking our data scientists to do every time that they want to add this new feature, it’s really going to hinder that speed of iteration. It’s going to be very slow to add new features, and we’re not going to be catching these new kinds of attacks very effectively. And so data engineers have to go to very great lengths to hide all these complicated details and just provide a simple platform that does all of this under the hood. And the reason for that is so that data scientists can do data scientist’s data science as much of the time as possible.
What that means in practice is the data platform team really has to provide this very simple playbook for how to add new features to our rescoring system, and it should really be as simple as it would be to add a unit test in a traditional CICD system. See here, we have this flow chart. These are the different ways that you can join in new kinds of data. And as we saw that works okay for the SparkFiles and the broadcast variables, but Spark joins are still really complicated. It’s not only this time travel problem, you still have to write a bunch of complicated Spark code.
The way that you should solve that problem as a data engineer is to provide a very simple interface that takes all of those details, keeps them under the hood and only asks the data scientists or the ML engineer to provide a very few, simple functions that are actually pertinent to the new feature that they’re adding. So this actually is very similar to the interface that we use at Abnormal, and you can see here, there’s a function that asks for that set of keys that we were talking about. There’s a function that asks, how do you hydrate a feature once it’s joined in? And those things are pretty reasonable. They’re actually the things that the data scientists working with the feature know about.
And so if we go back to the set of requirements we had for good CICD for an ML platform, we really wanted our ML engineers to be able to move as quickly as possible while still rescoring and making sure that we’re not degrading our system. And so as the data engineer, the job that you have to do is first provide this very simple API that just works. It hides all these details under the hood.
Secondly, you have to make the system run really efficiently, because there are two use cases for this kind of integration test. The first is very similar to an integration test that you might see in traditional CICD. You want to run this on a regular cadence so that you can make sure that at any point in time as changes come into your data sets, to your codes, to your models, your system is still performing very well. The second use case is this kind of ad hoc use case where an ML engineer might want to test out a new feature. You should provide a system that very quickly let’s them run kind of an offline AB test to see how that new feature affects the metrics that they care about. And so with that, I’ll hand it back to Jesh for some closing thoughts.
Jeshua Bratman: Thanks, Justin. Now that we have built this whole thing, I’m going to flip the question around that I asked earlier. I asked earlier, what if we don’t have something like this for machine learning? But what if we do have something like this? What does this change about the work that you do? Well, it lets you quickly iterate. It lets you know if you break things. Lets you train models on all your old examples. And you’re going to have a better, more flexible product. You’ll be able to address customer requests more quickly. If your customer says, “Hey guys, you missed this false negative.” You can actually go and build a patch to it however you want in the system, and be confident that you didn’t break some other previous attack or some other previous example that you want to classify.
Also, you’ll be able to support a much larger team of ML engineers, all work in parallel, because they can all be confident that their changes will be integrated and tested in a way that can securely ensure the performance of the whole system. So you can have anything from an ML engineer working on feature requests, to some researcher trying some crazy new experimental model and all of these things can be supported in the same manner.
The main takeaway I really want to give you from this talk is, whatever you work on, if it’s a cybersecurity problem or something else, if machine learning is a core part of the performance of your product, I highly recommend investing in something like this as early as possible, because it’s going to open up so many doors and put you on such a good trajectory. And if you already do have something like this, just continue to invest in it, continue to make it better and make it easier for ML engineers to use.
Thank you. If you’re interested in any of these, of course we’re hiring. And from that, thanks, and any questions we’re happy to answer.
Jeshua has over 10 years of experience in machine learning, and is an expert in building technology products powered by artificial intelligence. He’s a founding member of Abnormal Security and head ...
Justin Young is a software engineer on the Abnormal Detection Systems team developing data platform for running threat detection products. Previously to joining Abnormal, Justin worked on the revenue ...