HomepageData + AI Summit 2022 Logo
Watch on demand

Scaling Salesforce In-Memory Streaming Analytics Platform for Trillion Events Per Day

On Demand

Type

  • Session

Format

  • Hybrid

Track

  • Data Engineering

Industry

  • Financial Services

Difficulty

  • Intermediate

Room

  • Moscone South | Level 2 | 205

Duration

  • 35 min
Download session slides

Overview

In general , in-memory pipelines would scale quite well in Spark if we apply the same processing logic to all records. But for Salesforce the major challenge is, we need to apply custom logic specific to a Log Record Type (LRT). The custom logic includes applying different schemas while processing each event. So performing such custom logic specific to LRT , we need to have a mechanism to collect LRT specific data In-Memory such that we can apply custom logic to each collection. We normally get around 50K files in S3 every 5 minutes and there are around 4 billion log events there in 50K files. Creating a DataFrame from 50K files, then group events by LRTs and applying filters per LRT to create a child DataFrame is one approach. One major challenge is that LRT data distribution is very skewed , so we need an efficient in-memory partitioning strategy to distribute the data. Also just applying filters on parent DataFrame will have many child Data frames with empty partitions due to large skew in data distribution and this creates too many empty tasks while processing child DataFrames. So we need to have a Partitioning schema to distribute data and filter by Log Type but not create unnecessary empty partitions in child DataFrames. We also need a scheduling algorithm to process all child DataFrames to utilize cluster efficiency. We have implemented a custom Spark Streaming for reading SQS notifications and then reading new files in S3 which is designed to scale with ingestion volume . This talk will cover how we performed a Spark RangePartition based on Size distribution of the incoming data and applying schema specific transformation logic. This talk will explain various optimizations at various stages of the processing to meet our latency goal.

Session Speakers

Headshot of Dyno Fu

Dyno Fu

Lead Software Engineer

Salesforce

Headshot of Kishore Reddipalli

Kishore Reddipalli

Sr. Director of Engineering (Unified Intelligence Platform)

Salesforce

See the best of Data+AI Summit

Watch on demand