Cloud, Cost, Complexity, and threat Coverage are top of mind for every security leader. The Lakehouse architecture has emerged in recent years to help address these concerns with a single unified architecture for all your threat data, analytics and AI in the cloud. In this talk, we will show how Lakehouse is essential for effective Cybersecurity and popular security use-cases. We will also share how Databricks empowers the security data scientist and analyst of the future and how this technology allows cyber data sets to be used to solve business problems.
George Webster: Hi, welcome to the talk about Unleashing your Security Practitioners with Data First Architectures. We’re going to be focusing on empowering our SIEMs and talking a little bit about the Lakehouse architecture and how we can use it for cybersecurity. I’m George Webster. I’m the Global Head of Science and Analytics at HSBC. My background, I’m focused on large scale data analytics, the offensive mindset, and mostly arguing for budget because unfortunately, I’m a manager and I don’t get as dirty with my hands anymore. Initially I was from the Department of Defense, the Central Intelligence Agency, Academia, and then I sold out and went to financial services. I love to cook. And if you guys can’t tell, it does show a little bit. And my picture was a professionally mandated headshot and it looks pretty good. Along with me, I brought Jason Trost. He’s focusing on developing capabilities, mostly on network security. He also works with the data science role of how can we start sticking data science into the normal practices of our security practitioners.
He’s also from the Department of Defense, he dibbled and dabbled with startups, and then just like me, he sold out, and came to the financial services. He has a very strong, meme game. It’s quite enjoyable to look at him in the chats. And look at that picture because seriously, no sane bouncer ever would have accepted that. We also brought one of our colleagues, which is Monzy. He’s a VP of Cybersecurity Go-to-Market for Databricks. And his background is, he’s really focused on serving cybersecurity teams. So how does he help us get to where we need to be? He’s from the Department of Defense as well, even if he won’t state it. He’s also worked at National Security Labs and he worked at Splunk. And way back in the day when I first met him, this was about 20 years ago, he was obsessed with green chilies and he still is today. He took that picture himself, but he also looks pretty good.
Before we begin, a little bit about the legalese of this. The numbers we’re going to talk about here, all that fun stuff, it’s coming from peer reviewed material or major publications. In no way is this anything do with HSBC or Databricks. The same thing for the demonstrations, the architecture we’re going to be focusing on those patterns and processes, very much a reference here. This isn’t actually our code. This isn’t from HSBC. It’s not from Databricks. But we’re trying to make it as realistic or as real as possible. There’s some other stuff on there too. You can probably go ahead and read it, but we need to make sure that you’re aware that this is for a presentation. This isn’t representative of actually what we’re doing at HSBC or Databricks.
Who is HSBC? And a lot of the guys right now in the US are probably like, “I have no idea who HSBC is.” But it’s actually a multinational investment and financial services company. We were founded in about 1865. We operate almost everywhere. And we’re also one of by far the largest financial companies in the world. So we’ve got about three trillion in assets. We operate in somewhere like 64 countries. We’ve got about 226, upwards about 300,000 employees all around the globe. We have about 40 million customers. So we’re incredibly large, even if you guys in the US haven’t of us. Our office and for this is myself and Jason Trost, the mission statement’s there. You’re welcome to read it. But the real main takeaway here is, we’re focused on figuring out how can we empower our people, our processes, and our technology, so we can get that analyst of the future, so we can start getting ahead of everyone and the attackers and staying ahead.
So when we begin the talk, let’s talk a little bit about the problem here. And if you look at it, you’ve got a defender. It usually takes about 200 days to detect malicious activity. If you look at an investigation, once you know something’s there, we need to investigate it, when you figure out what’s happening so we know what we need to do. That pretty much is 54 days. And that number, it doesn’t really matter if you looked at it 20 years ago, 10 years ago, five years ago, or even if you look at it tomorrow, those numbers are fairly constant. The unfortunate part is, if you look at the attackers, it goes about 24 hours from victim A to victim B. And that’s a pretty bad situation. The attackers are operating in hours, well we’re operating in days. And we need to get ahead of them. We need to get better. We need to be proactive.
So when we go into this talk, like in the back of your heads think, “How are we going to do this? How can we start to do this? And how do these major financial institutions actually keep everyone safe?” Our customers, the bank, all that fun stuff. If you look at the paradigm and what’s going on with the world of the SIEM, this is pretty much is quite representative of a cybersecurity, our tooling, our operation center and how everything stitches together. We’ve got hundreds of security tools. We’re going to buy the best, we’re going to deploy the best, we’re going to close down that control. However, with all these tools, they’re pretty big beasts. They’re going to be sending alerts because that data is locked within that tool itself or that spectrum of tool or that set of tools. And that’s why you use clouds to represent email. It’s not just one tool. It could be upwards of 20 plus different tools. But that general alert is going to be sent to the SIEM. And the data itself starts to put pressure on that SOC.
So that SOC, that person who actually understands those security alerts, searching through it, doing their use cases, is going to be using that primary tool. But now if there’s an email issue, they’re going to sit there and say, “Great, I know there’s an issue. Now I need to go to that email tool and to dig in further and to dig in further, to dig in further.” And what this causes is, the SOCs start to compensate for issues with architecture. They’re doing the analysis for us. We’re putting all that cognitive load on that person. That’s a problem. And if you think of cybersecurity though, this is a huge, big data problem, right? And it’s massively costly and it’s capabilities are being limited. If you look at the end point, this is your agent you install on your laptop, and a corporation about the size of ours, you’re looking at about 100 terabytes a day, of just that log traffic.
If you start to look at your network sensors, you’re looking at somewhere between 40 and 50 terabytes a day on top of that. When you start to look at your cloud and your deployments in the cloud, even further, just your VPC flow data. That’s another about 20 terabytes. And if you start to look at just the kind of security alerting and the information this SOC also needs in the cloud, you’re starting to add five to 10 terabytes a day on top of all that. What this all means is, for security analysts on one day alone, you’re looking at about 100 to 200 terabytes a day of data. If you think of that architecture I just showed you, it starts to make sense of why it was stitched together like that. You have a capability problem. But it’s a little bit worse than that. Go back to those numbers we showed. 54 days to do an investigation, 200 days to detect malicious activity, you have a historic problem here. So if we want to retain this data for 13 months, we’re talking about somewhere between 38 to 79 petabytes.
So if we look at this problem, we have a cost issue. I’m not going to throw all this amount of data into a SIEM, it’s too costly. I can’t throw it all into one tool. We also have the complexity issue. How do we start to get access to our data? How do we start to use it in near real time? And how am I able to get it into the optimal form or shape to be able to do what we need to do? And on top of it, we have the cloud, which adds additional layers of complexity for us. So in this paradigm, and what we’re developing here in our architecture, we’re trying to figure out how can we do this in a way that’s cost-effective. How can we unlock our data? How can we enable analytics? And how do we start to empower our people?
So again, it’s all about the people and we’re doing this by using Apache Spark for our ETLs, break that lock in that we have from the vendors. We’re starting to put it in Lakehouse, so we can start getting into the forms that we need it to be in, to start to create those capabilities that are going to help drive the mission and be specific for that mission, whatever that happens to be. What was it yesterday? Cool. We have something, but tomorrow we need something else. Now we’re able to start to do that and start operating at pace and speed. And then we push this information back into the SIEM, which again, unlocks that for our people and gets it in the form that they need, which allows us to pretty much start crushing it. With that, I’m going to hand it over to Jason Trost. He’s going to actually show you some of these cases and dive in a little bit deeper.
Jason Trost: Thanks, George. I’m Jason Trost, and I’m going to cover two different cybersecurity related use cases. So the first use case is threat detection and DNS data. So we’re going to break this problem into two pieces. One, we want to look for lookalike domains. So pictured here are several different lookalike domains of databricks.com. If you take a look at the one I have highlighted in particular, this is databricks.com, but you can see an extra dot in there. So why might someone leverage these sorts of domains? Well, they look a lot like Databricks domains and these could easily be passed off and used for phishing. So you get a Databricks employee or another employee to click one of these links, maybe enter in some credentials, and the bad guy has now been able to steal those credentials. So we want to be able to detect this sort of activity if it hits our network.
The next area we want to focus on is DGA domains. So these are domain generation algorithm domains. And as you can see, these look really kind of long and random, and there’s a good reason for that. So this is a common trend for malware to algorithmically create domains based on pseudo random number generators, where they might create, hundreds or thousands of these domains per day, and attempt to communicate with every one of them. Knowing that at least one of them will be controlled by some actors, some adversary, will allow this malware to establish a command and control channel, as well as bypass any sort of blacklists that’s based on that large static list of domains. So these are another challenge of things that we want to be able to detect in our DNS data. So how do we do this?
Well, we have an enterprise our size. We have roughly 10 terabytes worth of DNS logs coming in per day. In order to really take advantage… I’m sorry, in order to act on these threats as fast as possible, we need to perform real time threat detection. So as soon as we find one of these things, when we need an alert going to our SIEM. In order to actually take action on these, in order to actually detect these, we need to use things like machine learning, rules, and threat intelligence enrichments in order to find these in our DNS data. And lastly, like I mentioned, as soon as we find one of these, we’re going to send an alert to our SIEM. So here’s the recipe for what I’m talking about. So we’re starting with a passive DNS data set. So these are all of the DNS requests that are happening within our network. And these are being dropped somewhere like Amazon S3.
And they’re being dropped into files that are getting rolled say roughly every two to three minutes in near real time. From there, these logs are being pulled into a Spark streaming architecture where we can then perform the enrichments that we need to perform as well as the detections. So for enrichments, like I mentioned before, we’re going to perform threat intelligence enrichment’s. So these are bouncing these domains off of lists of things that are known bad. We’re also going to do IP Geolocation look-ups. So this allows us to understand where are the IPs associated with these domains are located in the world, which is very, very useful when doing analysis and especially triaging these alerts. And lastly, we’re going to pre generate a massive list of lookalike domains for every brand that we care about.
So for example, Databricks or HSBC, or really any other third-party vendor that we might operate with, using a tool like Dnstwist or some other open source tools. Once we have this enriched data set, it’s then primed and ready for performing the detections that we need to do. So the first one is domain generation algorithm. For this one, we’re going to need to use machine learning. And then for the next one, lookalike domain detection generation. So for this one, we’re going to use the enriched data set that I mentioned earlier. Lastly, in order to make this as fast as possible, we’re going to deploy this into a streaming architecture, so we can perform these analytics in near real time and send alerts to our SIEM. So what does this look like from a technology perspective? So like I mentioned, we have about 10 terabytes worth of DNS data coming in to Amazon S3 every day. This data is being pulled into an ingestion layer where we need to perform ingestion, ETL operations, and normalize this data to ultimately land into Delta Lake, where we then store this data in a really nice format.
We can perform queries and run analytics on it, where we can perform our enrichments. And we can optimize this for the actual use cases we’re talking about. I mentioned that we need to use machine learning for the DGA kits. So for this, we use MLflow for setting up our training jobs and performing the classifications. Once we have our machine learning engine running across this DNS data, we’re treating this almost as a filter. So we score these domains that are coming in, anything that looks like a DGA domain and has a high enough score, is going to be sent to as an alert to our SIEM. So this in essence, is taking all of our massive 10 terabytes worth of logs per day, and funneling that down to just the interesting events down to megs or potentially gigs of logs going to our SIEM per day. We also want feedback coming back from the SIEMs. So as analysts triage these alerts, they find false positives or have other issues, they can mark those and we can pull this back into the training center, make sure that our machine learning model continuously gets better.
We also want to use things like SQL analytics to perform querying and reporting. So for the lookalike domain generation, for example, that’s purely a many to many joined against the DNS data and that massive lookalike table. So for this, we can use SQL analytics, perform queries and reporting, and make sure that this gets pulled into the SIEM as needed. Your analysts can take full advantage of this. So the benefits of this approach are twofold, it’s speed and scale. So we’re able to process 10 terabytes worth of data per day, which is something we couldn’t do before. It’s something that’s really not cost-effective to do directly in the SIEM. We’re able to augment our SIEM economically. So instead of processing all 10 terabytes, we processed them outside of our SIEM in a system that is highly economical and fast for doing this. We sent only those alerts that we need to back to the SIEM. We’re able to leverage things like machine learning and advanced analytics, also something that our SIEM is not capable of doing, especially at this scale.
And then lastly, we’re maintaining that real-time detection capability with DNS threats, which is really a requirement for everything we do, it needs to be as real time as possible so we can act on it as quickly as possible. So now I’m going to go into the second use case, which is large-scale threat hunting. The goal of threat hunting is to sift through cybersecurity log data in order to find signs of malicious activity, with current and historical that have evaded existing security defenses. So this is a very proactive activity that a lot of large enterprises do these days. With the goal of trying to find adversaries that may have somehow, some way made it into the network and are operating there. So we want to be able to throw as many analytic techniques and advanced processes that we can, to see if we can find them and put them outside of our existing controls. So how do we do this? Well, one in order to do this, we need to be able to explore large amounts of historical log data. We also need to be able to correlate activity across log sources.
So for example, some of the logs that George mentioned earlier, we start with end points. We want to just detect activity moving from the endpoint through the network, into the cloud. That means we need to be able to do many to many joins across those three different datasets and do this automatically on that scale. Next we want to leverage analytics, not only detection and machine learning. So these adversaries are doing everything they can to evade detections. They try and make their attacks blend in with normal traffic. So we really need to be able to up our game and use advanced techniques where necessary to help us detect them. Now lastly, all of this needs to be repeatable, self-documenting, and team oriented. So as threat hunts occur, we want the threat hunters to be able to collaborate with each other, share results quickly, and ideally make reusable tools that can be pulled off the shelf later and used for future threat hunts.
Not only that, all of this needs to be done at scale. So the large numbers George mentioned earlier, you really need to apply these four things at that stale, which makes this a pretty massive problem. By the way, we also need to be able to do this at pace. So threat hunting is generally not some sort of leisurely research activity, and it’s done over long periods of time. Generally, this is something that is kicked off by major world events. A threat intelligence bulletin is released, our executive stakeholders learn about something that might hit the news, and they really need to know as fast as possible, is this threat impacting us? And that’s where our threat hunters have to jump in and help. So let’s go one layer deeper into what I mean, with a hypothetical example. A new mass supply chain attack is discovered in the wild, and the details of this activity get made public through a government threat intelligence bulletin.
This bulletin contains many details, including the tactics, techniques and procedures of the adversary, as well as things like domain names, IP addresses and malware file hashes that were actually used in the attack. But the report also claims the activity that they found started about a year ago. So this report gets made public, our executive stakeholders learn about it, and they want to know, is the adversary in our network right now? Or was this adversary ever in our network? So that’s where the threat hunters need to step in, and the scope of this threat hunt is going to be 12 months. Because this activity happened about a year ago, to do our due diligence we really need to go back at least a year, to do this. So what does this look like? How do we execute and threat hunt like this? So in most companies, the SIEM is where security data lives. And this is because the same is kind of the center of the universe for all detection and response actions.
But 12 months of EDR and network logs, like George mentioned earlier, are likely several petabytes worth of data and are just not going to be in the SIEM for economic reasons. And even if we could put them in the SIEM, most SIEMs are just not designed for large and complex historical searching over this scale of data. They don’t support things like many to many joins. So if we wanted to follow the trail from the endpoint through the network and into the cloud, it’s just not going to be possible. They don’t adequately support machine learning and AI use cases, especially at the scale that we need to, and they’re not open platforms. So we need a better way to do this. So what does this look like from a technology perspective? So we have our massive logs coming in, cloud, endpoint, network, et cetera, according to roughly a hundred terabytes worth of data per day. So let’s store this in cheap cloud storage and use Delta for our ingestion.
So cheap cloud storage is going to more than handle our retention requirements and be much, much cheaper per gig than our SIEM’s ability to process this data. And Delta Lake provides the really nice ingestion layer and formatting of this data to allow us to perform complicated queries and analytics at the scale that we need to. Let’s use Spark. So we really want something that can take advantage of the elasticity of the cloud using something like Elastic MapReduce, or the Databricks platform to leverage Spark, to reach into Delta Lake, grab the data that it needs and perform the complex analytics that our threat hunters need to do. In order to expose the Spark cluster in Delta Lake to our hunters, we’re using Databricks notebooks, which allow them to easily search and query this historic data at the scale that they need to, taking full advantage of the elasticity of the cloud of the Spark cluster and the economic storage within Delta Lake.
And lastly, our threat hunters are now able to develop Databricks notebooks that help them codify their threat hunts, allowing them to perform queries, get back results, iteratively answer questions, and then share the results and collaborate across the team. So quick summary of the benefits of this approach, are both scale and speed. So we’re now able to handle processing all the required data, about 100 terabytes worth of data per day. We’re able to increase our online queryable retention rate from days, to many months, which gets us into petabyte scale, which is something we absolutely need. We anticipate that the scopes of our threat hunts can be now much larger, because we both have more data online and more capacity to actually process this data. And for speed, we can outperform these advanced analytics at the pace of the adversaries. Our threat hunters have these Databricks notebooks and these massive Spark clusters they can take full advantage of to ask questions as fast as they need to and get back results quickly.
The threat hunts are also now reusable and self-documenting through notebooks. So we’re hoping that because of this, our threat hunters will easily be able to pull these notebooks off the shelf for future threat hunts that are similar or potentially every couple of months, let’s rerun the past hunts that we just ran with updated parameters and just see what has changed. Because of these two things, we anticipate that we will be able to execute two to three more hunts per analyst, because there is no longer going to be bound by hardware. And they’re going to have a lot more reasonable tools at their disposal. Now that I’ve talked about the benefits of this approach, Monzy’s going to walk through a detailed demo of a threat hunt using Databricks.
Monzy Merza: Thank you, Jason. As you all heard from Jason and George, about the massive scale that they need for HSBC for the security operations, and Jason talked about DNS and the threat hunting use case. So before I jump into the demo, let me give you an overview of what the demo is going to look like. First, I want you to focus as I go through the demo that I’m talking about multiple personas. I believe that Databricks for security is relevant to all security teams. So we’re going to talk about personas like data scientists. We’re going to talk about personas like security practitioners, and I’m even going to show you where someone who is not necessarily a security practitioner of data science can also use Databricks. We’re going to look at the DNS use cases, just the way Jason talked about. I’m going to deep dive into the DGA analytic piece to show the data science persona, how that works in Databricks, and we’re going to go and look at this problem.
When you have these piles and piles of IOCs, how do you do the match against many to many joins that Jason talked about? That’s a difficult problem. So with that, let’s jump into the demo. So here you see on my screen that there is just a simple form field that has a domain name in there. And what if you just wanted to know if this domain name is actually a dynamically generated domain name? So here we can just type in the domain name and we can run and execute that. It’s a simple form execution, and you can see where it says the score for this domain name is IOC. Very easy to execute, but now you’re thinking, well, how the heck did you figure this out? Now behind all of this, we built a DNS notebook, a DNS detection notebook that follows the recipe that Jason laid out.
It has the ingestion. It has the detection models, the threat Intel enrichments, and the productionization of those detection models with machine learning, using MLflow. And that is what all of that cycle is what happened in the background, which enabled us to do a very simple search, but it collected all of those things together. So now let’s dive in and see, what does it look like from a data scientist perspective when they’re trying in Databricks, when they’re trying to develop this machine learning model for dynamically generated domain name algorithms? On your screen what we have, is a DGA model that’s already been built. There are lots of different examples of DGA models online. This particular one is built in Python. And so you can see here that the first thing that we’ve done, we’re using scikit-learn to actually pull in the methods and functions that we need in order to test for weirdness. We are going to look across multiple types of things in this model itself.
We’re going to look at entropy. We’re going to look at n-grams from Alexa, because this model treats the Alexa one million list as a non DGA or non domain name algorithm, generated list of algorithms. And then we’re going to use MLflow to actually execute and productionize that in streaming. But before we do all of that, we’re going to train this model. So I’m here in this line 1819, this is where we’re going to train this model and create the fit for this model as well. And when all of that is done, we’re actually going to use MLflow to store this model and have an identification for this model, so you can go back over time and see how well this model is working. So now you’ve seen the data science persona. Now let’s take a look at what the threat hunter persona looks for when they have these piles and piles of IOCs.
Jason showed you this example where he said, there’s a report that comes in and they say, “Well, have we been impacted by this particular threat actor?” So here, I’m going to show you this many to many join example that a lot of threat hunters have worked with all the time. So on this screen, I want to call your attention first to this command block 10, where I have the select count from the silver threat feeds. What I wanted to show you here, is that when you get IOCs, they’re just not like one or two. They may be in the tens or hundreds. In this example, there’s 27,000 IOCs that we have, that we have to match. And if we scroll down below, you can very quickly see that we are going to match all of our domain names and our DNS domain name table with all of these IOCs, these 27,000 IOCs. And we’re going to execute that search.
So now you can see that it runs in a flash, because Databricks enables this very, very massive compute capability. And so now we can get our results. And one of the results that’s very interesting for us is, all of these different features that we’re getting from different websites that provide this, for example, the malware downloads, or URLhaus, and so on. I really want to send a shout out to the URLhaus guys for making this threat feed available to everybody in the community so that we could actually do this demo for this many to many join example for you. So with that, that ends the two sections of using SQL analytics that you saw, you saw us use a form based search, and you saw what it looks like when we want to do this many to many join while we’re doing threat hunting.
So now I’m going to switch back to the PowerPoint and walk through one more set that Jason talked about. One of the biggest things was, even though SIEMSs might not scale or SIEMs might be difficult to work with, they’re still part of the environment and we want to make sure that operation continues. So if we go to the next slide, what you will see is, that we have the Databricks add on for Splunk that enables us to send queries and notebooks and jobs from Splunk into Databricks, and then get the results and kickoff searches and results back into Splunk. Next slide please. And so here in this next slide, if you’re familiar with Splunk, you can see that this is the Splunk UI that everyone knows, and you’re simply using the Databricks query command to send a query to Databricks from within Splunk, and you’re getting the result back into Splunk.
In the next slide what you’re going to see, is we’re actually executing a notebook from Splunk into Databricks and getting the results back into Databricks again. Again, you’re not leaving your Splunk UI at all. If you go to the next slide, what you’ll see is that we have this complex Splunk search, which has the Databricks query, just one part of the overall search pipeline. So I put this together so you can see that everything can work together and that you can do saved searches and you can do other automated tasks, within your Splunk pipelines in order to take advantage of Databricks. And the last slide that I want to show you for this demo section is around what happens when the results from something that happens in Databricks goes into Splunk, and you can see those results here.
In this example, that form search that we ran very early on, actually created an event that then got indexed into Splunk. And this is the IOC event that you see in the JSON format and the Splunk screen here. So now let’s summarize what we did. So if we go to the next slide, I just want to talk about, and maybe advanced one more, please. Let’s just talk about the conclusion. What have you seen during the course of this session? So four big key takeaways on this slide. Next slide, please. First, you saw that there is a large gap between the threat hunters and the attackers, or the security teams and the attackers between hours, 24 hours, is what George mentioned, to 200 days or 54 days to detect them and then to investigate, so that’s a big gulf.
Second, legacy SIEMs are not very good at this kind of activity for petabytes and petabytes of data that Jason and George talked about. And what they did, was they implemented the Lakehouse architecture at HSBC to address these problems that they have. And lastly, all of these methods can really unlock your teams across all of your environment as you saw. So what’s next? So what I’d like you to do is to check out the deep dive demos for Databricks, and also, if you want to schedule a hands-on training or a workshop for your organization, so the Databricks teams can come in and help you. The DNS notebook that you saw is actually publicly available, if you just Google for DNS detect criminals, it usually shows up as the first or second within Google.
And if you really need to talk to us about anything or just curious about something, we’re very easy to find, take out your phone, take a picture of this, or you’ll have the slides for this later. You can use those. I’m [email protected] and both the Databricks cyber teams and HSBC cyber teams are available at [email protected] and HSBC. Thank you so much for joining this session and I’m looking forward to hearing from you soon. Thank you George, and thank you, Jason for partnering with Databricks on this. We really appreciate you as our customer and as our partner.
George Webster is the Global Head of Cybersecurity Science and Analytics. George is responsible for empowering the Cyber Security mission in protecting the bank by driving proactive tactical and st...
Monzy Merza is the Vice President of Cybersecurity Go-to-Market for Databricks. He is responsible for driving Databricks cybersecurity business strategy. Along his 15 years of experience, he has held ...
Jason Trost is Head of Analytic Engines in HSBC's Cybersecurity Sciences and Analytics division. He is deeply interested in network security, DFIR, big data and security data science. He has worked in...