Skip to main content
Industries header

The Databricks Lakehouse Platform for Cybersecurity Applications

The best kept secret in the cybersecurity industry
Lipyeow Lim
David Wells
Anna Cuisia
Share this post

Visit the GitHub repo for the IOC matching solution accelerator. Or, we can help you with a Proof-of-Concept (POC), so contact us at [email protected]


Cybersecurity remains a significant data challenge due to the growing number of financial institutions, healthcare providers, and government entities moving their data to the cloud, and the rise of IoT sensors and interconnected devices. Amidst ongoing geopolitical threats, enterprises are turning to the Databricks Lakehouse platform for their cyber operations because of its ability to handle large amounts of data, support complex data processing tasks (including advanced analytics capabilities such as artificial intelligence and machine learning), and scale cost-effectively. The Databricks Lakehouse platform is a hidden gem in the cybersecurity industry that unifies data, analytics and AI in a single platform.

Enterprises and cybersecurity vendors build their cyber products and services on the Lakehouse platform; Global System Integrators (GSI) and partners build their cybersecurity solutions on top of the Lakehouse.

This blog will showcase partners and customers building cybersecurity applications and solutions on the Lakehouse that all Lakehouse customers can leverage.

We partner with global systems integrators and partners - such as Optiv, Deloitte, KPMG, Slalom, Accenture, Booz Allen, Ernst & Young – to help Lakehouse customers build cybersecurity solutions. We also partner with leading cybersecurity ISVs to deliver innovative applications to Lakehouse customers. Databricks partners with Hunters for SOC platform, HiddenLayer and AIShield to secure AI/ML models, Mach5 for full-text search, Graphistry for GPU-accelerated graph visualization and AI, Monad, Cribl for data ingestion; Theom, Immuta, Securiti, and Privacera for cloud data security, and Splunk for security orchestration, automation & response (SOAR) capabilities.

Cybersecurity Ecosystem powered by Databricks

Why do customers and partners build on Databricks Lakehouse?

#1 The Retention-Cost-Scale (RCS) triangle

Given the long-tail distribution of the dwell time of modern security threats, cybersecurity defenders now have to analyze longer windows of security data; hence requiring data retention of at least one year (the SolarWinds hack is a good example). Query performance on the long retention security data must be fast without breaking the bank. Legacy security tools can handle at most two of the three requirements in the RCS triangle but cannot manage all three. By having a separation of storage and computing, the Databricks Lakehouse platform can handle the increased workload on long retention queries without requiring administrative burden or investment in additional infrastructure, making the Lakehouse platform very cost-effective for cybersecurity applications.

For example, HSBC expanded their retention and enabled their threat hunters to perform 3x more hunts while lowering the total cost of ownership by leveraging the Lakehouse.

#2 A comprehensive and secure big data platform

The Databricks Lakehouse platform is a comprehensive and secure big data platform that unifies:

  1. Batch and real-time stream processing,
  2. Structured, semi-structured, and unstructured data,
  3. Analytics and AI/ML

Our cybersecurity ISV customers often have strict service level agreements (SLAs) on threat detection latencies and need to rely on real-time stream processing. Yet, in threat hunting and/or compliance applications, batch processing might be leveraged to control costs. The Databricks Lakehouse platform supports both batch and stream processing without complicated lambda architectures. For instance, cybersecurity ISVs like Akamai leveraged the streaming capabilities of the Databricks Lakehouse platform to sift through 10GB of events per second to keep their customers safe.

Another reason customers choose Databricks for cybersecurity is the ability to handle all types of data: structured, semi-structured, and unstructured. Cybersecurity is rife with semi-structured JSON logs, unstructured text, and binary data. Email security often deals with unstructured email data. Endpoint security often deals with binary executables and malware. Network security usually involves binary packet capture ("PCAP") data. The Databricks Lakehouse platform is a unified big data platform that can store, process and analyze these three data types at scale. Enterprise customers like HSBC use the Databricks Lakehouse platform to store, process and analyze all three types of cybersecurity data for threat hunting and other cybersecurity use cases.

Before the advent of the Lakehouse, organizations were forced to split their data into data warehouses for analytics and data lakes for AI/ML. With Databricks Lakehouse, customers do not need two technology stacks. For big data problems like cybersecurity, avoiding the need to maintain multiple copies of multi-petabyte data unlocks innovations with AI/ML. Databricks' machine learning capabilities can be used to build predictive models that can help identify potential cybersecurity threats before they occur. These models can be trained on large amounts of data to identify patterns and anomalies that may indicate a security breach or other security issue. Cybersecurity ISVs like AbnormalSecurity and Barracuda Networks leverage Databricks to develop advanced AI/ML for email threat detection.

The comprehensive capabilities described above require platform security. Databricks has adopted a multi-layered approach to security, combining network security, data security, and access controls that help enable the protection and privacy of your data. Network security includes VPC and VNet isolation, Private Link connectivity for users and data, and IP-based access controls that help prevent the exfiltration of sensitive data. Data security is supported using SHA-256 encryption for data at rest and TLS 1.2+ for data in transit.

Unity Catalog delivers unified governance across any cloud environment for data and AI assets in your Lakehouse, including files, tables, machine learning models, and dashboards. Moreover, Databricks employs fine-grained access controls such as role-based access control (RBAC) and secrets management to guarantee that users have appropriate permissions when accessing data. Lastly, Databricks provides regulatory and compliance enablement through our PCI DSS, HIPAA, HITRUST, FedRAMP, and IL5 offerings, simplifying adherence to stringent security standards.

#3 Multi-cloud, multi-region

Most enterprises today utilize multiple clouds and regions, requiring cybersecurity monitoring of the resources in multiple clouds and regions. Cloud egress costs and data sovereignty regulations often prohibit the consolidation of all the required cybersecurity data into one central location. Customers increasingly leverage the multi-cloud capabilities of the Databricks Lakehouse platform to perform threat detection locally in the Databricks workspaces in each cloud and each region while still providing the ability to query across all the Databricks workspaces as one logical data store. The Databricks Lakehouse platform is cloud-agnostic and enables the federation of data sources in multiple clouds and regions, thereby minimizing egress costs and complying with data sovereignty regulations.

The simplicity of the Databricks Lakehouse platform

Although some believe that a "best of breed" approach is the best way to improve defensive cybersecurity, the Databricks Lakehouse platform takes a different approach. The simplicity of the Lakehouse has accelerated the velocity of development and innovation at our customers by unifying streaming-and-batch, structured-to-unstructured, analytics-AI/ML, multiple clouds and regions, and retention-cost-scale. Our cybersecurity ISV partners can innovate new features faster, shorten the time to market those features, and enter the market with a competitive advantage. Our enterprise customers can be more agile with Databricks Lakehouse in keeping up with and getting ahead of the fast-changing threat landscape.

The Databricks partner ecosystem for cybersecurity

Databricks partners broadly with other ISVs and SIs to support our customer's journey in adopting the Databricks lakehouse pattern for cybersecurity. Some of our customers can build their cybersecurity applications on Databricks Lakehouse themselves. Some of our customers need assistance and rely on SI partners like Optiv, Deloitte, KPMG, Slalom, Accenture, Booz Allen, and Ernst & Young for implementation.

Since many customers rely on the Databricks Lakehouse for AI & ML, we partner with:

HiddenLayer & Bosch AIShield provide the ability to protect against AI/ML threats, including adversarial ML.

Graph visualization and analytics are increasingly relevant to cybersecurity applications because of the need to understand complex relationships in the threat landscape. We partner with Graphistry for GPU-accelerated graph visualization and AI.

Full-text search is critical for cybersecurity investigations. We partner with Mach5 for full-text indexing of the data in Delta Lake and the ability to use OpenSearch API-compatible tools like Kibana to search the data in the Lakehouse.

To help our customers ingest various cybersecurity data into the Lakehouse, we partner with Monad, Hunters, and Cribl.

To help our customers govern and secure their entire data real estate, we partner with Theom, Immuta, Securiti, and Privacera.

For sharing and collaborating on each organization's threat intelligence securely and privately, we partner with HeliosData for their clean room technology.

We partner with Splunk and Tines for security orchestration, automation & response (SOAR) capabilities.

For security operations, we partner with Hunters to provide a complete SOC platform solution. Hunters is our first cybersecurity ISV partner where customers can bring their Databricks workspace. The Hunters product will work on top of a customer-owned Lakehouse. The openness of this model gives the customer full ownership and control of their cybersecurity data. It unlocks the possibility of the customer leveraging the cybersecurity data in the Lakehouse to innovate and further protect their business.

Databricks is bringing threat intelligence feeds into the Databricks Marketplace. SlientPush and TegoCyber are the first threat intelligence data providers in the data marketplace. Databricks customers can easily leverage their threat intelligence feeds within their Databricks lakehouse using Delta Sharing. Those feeds can enrich cybersecurity logs, telemetry, and alerts. They can perform indicator-of-compromise (IoC) matching on ingest of cybersecurity logs for early detection of malicious activities.

The secret sauce to futureproof your cyber tech stack is ready for testing

The secret is out - it is possible to have comprehensive visibility into your organization's cybersecurity landscape at a reasonable cost. The Databricks Lakehouse platform can provide retention-cost-scale at lower TCO than legacy cybersecurity tools and thus increasing your visibility and posture. Moreover, with the rise in generative AI and LLMs, the Lakehouse paradigm can futureproof your cybersecurity technology stack. LLM technology like the Databricks Dolly model can revolutionize cybersecurity operations with cybersecurity co-pilot capabilities that can alleviate the cybersecurity labor shortage in the industry.

You can test the Databricks Lakehouse platform for cybersecurity. After you set up your Databricks workspace, clone the GitHub repo for the IOC matching solution accelerator and run the notebooks. We would love to work on a Proof-of-Concept with you, so contact us at [email protected] if you have any questions.

Try Databricks for free

Related posts

Engineering blog

Hunting for IOCs Without Knowing Table Names or Field Labels

July 15, 2022 by Monzy Merza and Lipyeow Lim in Engineering Blog
There is a breach! You are an infosec incident responder and you get called in to investigate. You show up and start asking...
Platform blog

Security Operations on the Data Lakehouse: Hunters SOC Platform is now available for Databricks customers

March 27, 2023 by Lipyeow Lim and Guy Yasoor in Platform Blog
Cybersecurity is a big data problem: The growing volume and complexity of data flowing in and out of enterprises have created new cybersecurity...
Platform blog

Cybersecurity in the Era of Multiple Clouds and Regions

August 30, 2022 by Zafer Bilaloglu and Lipyeow Lim in Product
In 2021, more than three quarters of all enterprises have infrastructure in multiple clouds . This trend shows no signs of slowdown with...
See all Industries posts