Scaling Security Threat Detection with Apache Spark and Databricks

Download Slides

Apple must detect a wide variety of security threats, and rises to the challenge using Apache Spark across a diverse pool of telemetry. This talk covers some of the home-grown solutions we’ve built to address complications of scale:

  • Notebook-based testing CI – Previously we had a hybrid development model for Structured Streaming jobs wherein most code would be written and tested inside of notebooks, but unit tests required export of the notebook into a user’s IDE along with JSON sample files to be executed by a local SparkSession. We’ve deployed a novel CI solution leveraging the Databricks Jobs API that executes the notebooks on a real cluster using sample files in DBFS. When coupled with our new test-generation library, we’ve seen 2/3 reduction in the amount of time required for testing and 85% less LoC.
  • Self-Tuning Alerts – Apple has a team of security analysts triaging the alerts generated by our detection rules. They annotate them as either ‘False Positive’ or ‘True Positive’ following the results of their analysis. We’ve incorporated this feedback into our Structured Streaming pipeline, so the system automatically learns from consensus and adjusts future behavior. This helps us amplify the signal from the rest of the noise.
  • Automated Investigations – There are some standard questions an analyst might ask when triaging an alert, like: what does this system usually do, where is it, and who uses it? Using ODBC and the Workspace API, we’ve been able to templatize many investigations and in some cases automate the entire process up to and including incident containment.
  • DetectionKit – We’ve written a custom SDK to formalize the configuration and testing of jobs, including some interesting features such as modular pre/post processor transform functions, and a stream-compatible exclusion mechanism using foreach Batch.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hello and welcome. I hope everyone’s having a good Spark Summit so far. I’m Josh Gillner with Apple’s Detection Engineering Team. In this session, we’ll be taking a look at threat detection using Spark and databricks with a specific focus on some of the neat tricks we’ve developed to overcome the challenges of scale. So let’s start with who my team is and what we do. At a high level Apple’s Detection Engineering Team consumes the telemetry that’s emitted by systems across our corporate infrastructure. Everything from host level log data to network sockets to application events. Our job is to sift through the data for patterns which could indicate malicious activity like malware or hackers on the network. It can be pretty challenging because we’re looking for people who are trying to hide their behavior and blend in with normal usage of the data, but that’s what makes it fun.

Which Technologies?

To do this, we use many different technologies. Some off the shelf and some custom developed. These are the ones that we’ll be focusing in on a little bit more in this talk.

Before we dive into details, it’s important to level set on some of the terminology we’ll use. We’ll start by looking at what we call a detection or the most basic unit of code for the building blocks of our monitoring program.

A detection is a piece of business logic that inputs some data applies any arbitrary number of transformations or pattern matching or statistics, and then it spits out a security alert. We encapsulate these as Scala classes where generally there’s a one class to one notebook to one databricks job relationship. Each of these can do pretty vastly different things, but they all generally conform to the same paradigm with the same components.

What Happens Next?

After one of these jobs has found something potentially malicious, the aware is fed into an orchestration system. We have a team of security analysts continuously triaging the alerts to figure out if there’re a real security issue or just something that looks suspicious but ultimately poses no risk. After the analysts finish analyzing, the orchestration system has hooks into internal systems that can be used to respond and contain a security issue. Things like locking accounts or kicking machines off the network. In certain cases, the orchestration system may already have enough information to contain an issue without a human being involved. And we’ll take a look at that a bit more later.

Now that we’ve an understanding of the general flow of those detections, let’s take a look at some of the problems that came up as we continue writing more and more of these detection jobs.

Problem #1— Development Overhead

The first was an unjustifiable degree of overhead in deploying new detections. Most of the code for basic IO and common transformations was exactly the same just reimplemented across many notebooks. And each of these jobs needed a corresponding test with people manually preserving Sample Files and writing Scala tests. This meant that we were unsustainably coming up with cool new ideas faster than we could turn them out.

Problem #2– Mo Detections, Mo Problems

Other problems started to surface only after we had already deployed a bunch of these jobs. Changing the behavior of all or a subset of the detections required a massive refactor across all these different notebooks even if the feature you wanted was comparatively minor. Each job needs some degree of ongoing maintenance and performance tuning, which can get pretty tedious with all these disparate configs and logic.

Problem #3 – No Support for Common Patterns

There’s some common patterns people tend to use and these detections, things like enrichments or statistical baseline that compare the novelty of current activity against historical data. But what we saw was that without any primitives at their disposal everyone had their own special way of solving the same problems. And when we needed to update or fix them, we would have to figure out each implementation to really take our detections to the next level, we needed to formalize how they’re written and make better use of our limited human capacity. So we started a journey to find a solution to these. We talked with a lot of industry partners and looked at some third party products, but were really unsatisfied because we couldn’t find anything that suited our needs.

So we spent a few months and arrived at a major breakthrough. An entirely new SDK for security detection called Detection Kit. This framework dramatically reduces the complexity of running new detections and helps us react faster by improving the quality and automating the investigations.


There are way too many features to dive into here, but I’ll highlight a few of them.

Let’s start with inputs since they’re the first step in any detection. The data sets loaded by these jobs need to change situationally. In production they’ll want a hive table name, and in the context of a functional test, they’ll want the path of a preserved sample file and dBFS. This means that the options of have to be externalized and passed in from the outside of the class through conflict. But we took it a step further building a switch between and spark.readstream with an externally provided partition filter. It allows us to change from a streaming to a batch detection on demand without changing the class itself.

In most cases, a detection acts like a bucket containing different pieces of logic which all look for different aspects of the same general high level attack. In terms of the code structure, we like to say that there are multiple alerts in a detection, where each alert is described as a Spark data frame resulting from some transform on the input, the detection class creates an exposes a key value map of all the award names to their respective data frames, which can be consumed either by other detections or the post processing methods that are applied before an event is sent for the analyst to review.

Much like we externalize control of detection input emitters do the same thing but for output things. In production, we send a worse to AWS Kinesis. But in other situations may call for running a disk or an in memory table. In emitter reps all of these output parameters like a target kinesis cue file pass or stream options. Much like inputs, emitters are also provided in config so you can easily change the output behavior of the detection without modifying the class.

Config Inference

There are a lot of small details required by these jobs that are pretty annoying to provide like the streaming checkpoint paths or the scheduler pools. We found that most of them can be inferred by convention using parameters that are already in the config object with optional overrides only if people want to specify them manually. This minimizes the amount of required fields and simplifies our config structure.

A detection can contain multiple related but independent of words, and most of the time, you’d want them to be configured in the same way. Things like using the same emitter or inferring the parameters by the same convention. By default, they will inherit the configuration of the parent detection, but remain individually configurable, should you want granular controls.

Modular Pre/PostProcessing

We can think of a detection as a sequence of data frame transformations. The first transform in this sequence or what we like to call the preprocessor is a method that’s supplied in config to be applied before the core detection logic. This is an ideal place to inject certain types of operations like partition filters that’ll determine how much of the input data will be processed, or the pushdown predicate on a stream to tell it where to start.

After all the data has been processed, and we’ve yielded a set of suspicious events, there may be some additional enrichments to apply right before they’re admitted, many of which are common amongst multiple jobs. To make this easier we’ve broken out some common transformation into reusable modules, which are defined in a mutable scholar list and applied sequentially at the end of the detection. Finally, there may be certain types of operations which are not natively supported by the structured streaming API’s. For those we can specify a transform function to be applied within a for each batch. And we’ll talk about a specific example of this in more detail later.

Manual Tuning Lifecycle detection

Detections are inherently imperfect. They’re designed to look for anomalies in datasets that are sometimes chock full of these weird little edge cases and idiosyncrasies that can change over time, and are really difficult to predict, even if you have all the historical data in the world. Actually, humans review these events so the signal to noise ratio is really important. And we continuously tweak the detections using these analysts suggestions. But doing them one by one takes a lot of time and while they’re being worked on, we don’t have any cycles to write new detections.

Self-Tuning Alerts

When analysts triage alerts they label them as either false positive if it found something that wasn’t a real security issue, or a true positive if it’s actually malicious. This feedback is memorialized in a Delta table, where we’ve incorporated it back into the detection pipeline. Rather than us manually tweaking each of these detections, the system learns from the animals consensus and automatically adjust the future behavior. In this way the bulk of this day to day tuning is self service and we saw about a 70% reduction in the total work volume when we deployed this, which was pretty nice.

Complex Exclusions

In some cases that are difficult for that automation to figure out, we may still have to do some degree of manual tuning. Each round of this tuning adds an exclusion to the detection logic until inevitably it looks like the top right hand side there. You end up with these really ugly addendums that are almost as big as the detection logic itself, not this thing, not that thing over and over and over again. To solve this, we build a new mechanism which breaks out a list of SQL expressions that are applied against all of our output and the for each batch transform at the very end. It allows us to be as complex as we want in the exclusion logic while still keeping the detection classes pretty clean. And because they’re integrated into our functional tests, we can ensure that overly selective or malformed SQL expressions are caught before they impact the production jobs. Although the matching events to these expressions are excluded from what analysts see, we don’t just wanna drop them on the floor. All of the excluded events are written into their own table, so we can continuously monitor the number of records that are excluded by each expression individually.

Repetitive Investigations…What Happens?

Let’s take a look at what happens during a word triage. Typically, there isn’t enough information inside the alert on its own to render a verdict. And they want to run some ad hoc queries in a notebook to gather some substantiating data. They’ll need to take some parameters from the payload, create a notebook and fill it out and then sit around and wait for these queries to finish. A lot of the information they’re looking for doesn’t really change all that much from investigation to investigation. Things like what happened immediately before and after, via word and what does this machine or this account typically do?

Automated Investigation Templates

To help with this, we’ve used the databricks workspace API to templatize and automate the execution of investigation notebooks. Each detection has a corresponding template notebook and a specific directory, which the orchestration system clones and populates with information from the work. And because many of the queries are the same between similar categories of detection, we can modularize the templates and reuse components with a percent run magic command in notebooks.

This lets us automate useful things like…

This functionality arms are analysts with the information they’ll need to make faster decisions. And it also abstracts the complexity of some neat things that people wouldn’t usually do manually on all the detections surrounding a suspicious process execution, some custom D3 will render an interactive process tree that will let you trace the lineage and its relationships. In this example, there was an interactive reverse shell following exploitation of a java web server process, it would have been pretty difficult to figure this out if you’re just looking at records in a table.

We’ve also taken some queries that would be executed in a template notebook and have the orchestration system run them via ODBC. Rather than a human interpreting the results in a notebook, the machine can evaluate the current activity that’s happening against the historical baseline it collects and render a verdict. If it’s competent enough in the verdict, it can automatically contain the issue without a human even being involved. So you’d end up with these detections that test well, but don’t necessarily work in the real world.

Detection Testing

And since fully structured, binary formats like Delta Don’t play with that well with revision control. We’d use semi structured formats like JSON and worry about codifying the schemas into our test suite.

So we wrote a test generation library that wraps Scala tests, and we can run them right inside notebooks. Using only the information in the config objects, it’ll create an ascertain for each word, and infer the paths of the hits and samples by convention. So the only thing you need to worry about as the author of a test is running your samples and hits to the right place. With this, we’ve seen a dramatic 85% reduction in the amount of code it takes to write a basic test for a detection.

But we took it a step further and integrated the notebook based test into our CI. It’ll clone notebooks from the PR into a specific directory and execute them on a real cluster that returns a JSON object summarizing the test results via DB utils. Our CI system gathers these results and will pass or fail to build based on what happened remotely in the databricks instance. It also frees us up to write to detections that do things which wouldn’t normally work in an IDE, like percent run imports of other notebook components.

Jobs Cl – Why?

Once these tests pass and the production notebook lands in databricks, we still had to manually create and configure jobs that would execute the notebook tasks. Each one might need a specific cron expression or wanna run on a certain cluster. And all these details were typed in and maintained by hand. At a certain point, we had way too many of these detection jobs to continue managing them with the databricks UI, and no real way of making bulk changes like moving a bunch of jobs between clusters.

But databricks recently announced a really nifty beta feature called Stacks. It’s a dream come true for anyone who’s looking to build a job CI system, because it removes most of the complexity from maintaining inventory and state across the different databricks API’s. You can package a job and all the resources it needs like notebooks or dBFS files into a tiny little config stanza, and it gets deployed all at once as a package.

Deploy/Reconfigure Jobs with Single PR

So we build a fully featured Job CI on top of this stack COI. All of our notebooks, files and job parameters can now exist and get with CI doing the heavy lifting for job deployment. We went the stacks config objects and then pass them into the COI, which creates or overwrites any of the resources as necessary. But for jobs specifically, there were some important finishing touches, which aren’t currently covered by stacks. We augmented the COI with a couple helper scripts that will kickstart newly created jobs or if a job or the underlying resources has changed, it’ll restart them to accept the new config. Jobs CI was the last piece of the puzzle for us and automating every piece of the deployment testing and the management of the detections from end to end.

Cool Things with Jobs CI!

So far, we’ve been able to do some pretty neat things with the job CI, that would have been a real pain to do with the UI. Things like moving a bunch of jobs between clusters all at once, or having it restart all the jobs when we change a shared component. And it was an ideal place for us to enroll the newly created jobs into our metrics platform, so we get monitoring and alarms on these streams by default. The only caveat is the writing stacks JSON is pretty tedious. So we wrapped a CLR utility to generate them with the questionnaire.

For the last thing we’ll discuss, I wanna take a step back from the detections and focus on some of the problems we’ve had in the triage space.

Problem #1 — Cyclical Investigations

If you look at the word trends over time, you’ll find that most of the events day to day aren’t completely novel. It’s very likely that something maybe not the exact thing but something similar already happened in the past and was already triaged and investigated or closed out. Particularly for analysts teams that operate 24/7 across multiple shifts. It’s really difficult for any one person to remember all of this historical context. And it results in this waste of resources cyclically reinvestigating the same thing over and over again.

Problem #2 — Disparate Context

The solution to this problem is research. If people read through all of the various tickets, wikis and investigation notebooks, we curtail a lot of the wasted cycles. But these data sets exist in many different places, each with their own search syntax and interface and it’s unreasonable to expect analysts to have either read every document ever written, or do these searches during a work triage one time is that a real premium.

Problem #3 — Finding Patterns

Because the data exists in disparate places, we can’t effectively mine it for insights. Over the course of many years, we can handle multiple incidents that are seemingly unrelated, but actually share some common feature that’s too obscure for any human to see. mapping the relationships between a works or incidents requires not just intimate familiarity with what’s happened before, but also that the connection between them be so glaringly obvious that someone happens to notice.

So we built a solution to these problems we call Doc Search. To start, we centralized and normalize all of the incident related data, the text payloads of email, correspondence, tickets, investigations, and wikis are all dumped into a Delta table. On top of which we built a document recommendation mechanism. Having a single place to search through all this knowledge was transformative on its own. But more importantly, it provided us with the means to programmatically leverage all the things we’ve ever learned to better inform what we do in the future.

“Has This Happened Before?” -> Automated

So what does this look like from an analyst perspective? We have code that runs and all of the investment template notebooks that will take the word payload and suggest potentially related documents in a pretty display HTML table. Includes some pretty useful features like the verdicts and analyst comments on the passwords, and a matching term list where the document and the award intersect, and clickable links into the system that it originally appeared in for email correspondence, and this is pretty neat, when you click the link, it’ll open up and Mac OS to the specific thread where you can do some further reading.

Automated Suggestions

So let’s take a bird’s eye view of how this works. We’ll go into more detail by each one of these steps here in a little bit. The template notebook will tokenize and extract specific entity types from the word payload. It runs those entities through an enrichment routine to ensure that every possible representation is covered before looking for occurrences and all these documents. Since we’re searching through structured blobs of text and Sparks equal like and contains operation and get pretty expensive. We use a more optimal concurrent string search algorithm called Auto Korsak that prevents performance degradation as the term count increases. Depending on the input terms, there could be a ton of tangentially related documents we don’t really care about. So the hits are run through a scoring algorithm to compute their relevance, and only the most useful results are gonna be given to the analyst.

To understand why we need to bother with Anatomy tokenization, let’s take a look at the structure of a typical work. There are a set of key value pairs and for the purposes of finding related documents, not all of them are created equal. Some of these strings like dates or HTTP methods or ports, are gonna be found across thousands of different documents and almost all of them aren’t gonna be related to each other. There are also some in here that are too selective like a timestamp that would only appear in this specific alert and nowhere else. But there are a subset of tokens that describe different aspects of an entity, and the machine, or accounts that are involved in the viewer. And these are the ones that are valuable for correlation.

Entity Tokenization and Enrichment

So we use a suite of regex expressions to extract the common entity types we care about. But that’s not enough on its own. If you think about a physical machine like someone’s MacBook Pro, there are many different identifiers that could describe it. It has a serial number, some MAC addresses, a hostname, and potentially many different IP addresses over time, if the system uses dynamic addressing like DHCP. Some documents might contain one or a couple of these but never all of them all in one place. So you’ll have documents all kind of referring to the same machine but each one uses a different attribute to describe them, which makes tying them together pretty difficult. To address this, we’ve run the extracted entities through multiple enrichments, making sure that we’re searching for the superset of those identifiers, and you’ll find all the related documents regardless of which piece they contain.

Suggestion Algorithm

After entity extraction and the enrichment and searching on the documents, the results are fed into a suggestion algorithm that will compute a term wise relevance score. For each matching entity will look at the number of documents that appears in over how long a period of time and its distribution across the different types of documents. The terms are ordered by an average rank percentile of those features such that the less common a term is across documents, the more valuable it’ll be as an indicator of relevance between them. Documents containing this subset of valuable terms are gonna be presented in order of how many they contain, with documents that have multiple heads go into the top of the list. And with that, that’s all the content I have.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Josh Gillner


Josh Gillner is a member of Apple's Detection Engineering team, responsible for writing business logic that detects and responds to security threats at massive scale. During his 7 years at Apple, he has leveraged many technologies to keep pace with ever-increasing attacker sophistication, including most recently Apache Spark. He spends most days buried in data.