Cutting the Edge in Fighting Cybercrime: Reverse-Engineering a Search Language to Cross-Compile it to PySpark
- Data Lakes, Data Warehouses and Data Lakehouses
- Moscone South | Upper Mezzanine | 159
- 35 min
Traditional cybersecurity Security Information and Event Management (SIEM) ways do not scale well for data sources with 30TiB per day, leading HSBC to create a Cybersecurity Lakehouse with Delta and Spark. Creating a platform to overcome several conventional technical constraints, the limitation in the amount of data for long-term analytics available in traditional platforms and query languages being difficult to scale and time-consuming to run. The situation in cybersecurity is that not a lot of analysts have a deep understanding of Apache Spark.
In this talk we’ll learn how to implement (or actually reverse-engineer) a language with Scala and translate it into what Apache Spark understands, the Catalyst engine. We’ll guide you through the technical journey - including examples of Databricks Notebooks and code blocks - of building equivalents of a query language into Spark and how to implement another search query language features that are not possible out of the box, like IP CIDR matching or fuzzy matching across all columns. We’ll show you how to use the same framework for PySpark code generation and use-case reconciliation.
We’ll learn how HSBC business benefited from this cutting-edge innovation, like decreasing time and resources for Cyber data processing migration, improving Cyber threat Incident Response (IR), and fast onboarding of HSBC Cyber Analysts on Spark with Cybersecurity Lakehouse platform.