An Advanced S3 Connector for Spark to Hunt for Cyber Attacks
- Data Engineering
- Moscone South | Level 2 | 211
- 35 min
S3 is different from HDFS. The architecture of the Object store makes the standard Spark file connector inefficient to work with s3.
Fetching metadata of paths and files takes more time for object store rather than filesystems. There were a few improvements in reading object stores in recent Spark versions. E.g. They were focused on either the number of calls to S3 API or optimizations of those calls. It improves performance, but it’s not enough for long-running streaming applications.
There is a way to tackle this problem with a message queue for listening to changes in a bucket. What if an additional message queue is not an option for you and you need to use Spark-streaming? You can use a standard file connector, but you quickly face performance degradation with several files in your source path.
We want to present our file connector for the object store that we will be later this year open-source. Our solution is different from existing solutions in a way of approaching the problem. We utilize metadata of path structure, especially time-related metadata like a year, month, and day to shrink the scope of listed folders.
We will explain why we implemented our solution with DataSource v1, and show the possible gain of implementing the connector with DataSource v2- e.g. the ability of custom partition schema that would be available as columns in Spark SQL.
You will learn how to use custom connectors, and what are the pros and cons of file connectors available for Spark. Then we will present a use-case of utilizing time metadata of CloudTrail to efficiently collect logs for hunting cyber-attacks where the connector performs best. As we mentioned before, the solution will be open-sourced as a library before the conference.