ホームData + AI Summit 2022 のロゴ
Watch on demand

Improving Apache Spark Structured Streaming Application Processing Time by Configurations, Code Optimizations, and Custom Data Source

On Demand

Type

  • Session

フォーマット

  • Hybrid

Track

  • データエンジニアリング

Difficulty

  • Intermediate

Room

  •  Moscone South | Level 2 | 215

Duration

  • 35 min
Download session slides

概要

In this session, we'll go over several use-cases and describe the process of improving our spark structured streaming application micro-batch time from ~55 to ~30 seconds in several steps.
Our app is processing ~ 700 MB/s of compressed data, it has very strict KPIs, and it is using several technologies and frameworks such as: Spark 3.1, Kafka, Azure Blob Storage, AKS and Java 11.

We'll share our work and experience in those fields, and go over a few tips to create better Spark structured streaming applications.

* Configuration improvements:
- Increase the amount of tasks for reading the data from the blob storage (we found out that smaller tasks improve the total micro-batch time - Less errors, less retries)
- Reduce Spark data locality wait to 0 (in case that data is being read from an external source like Azure blob storage)
- Kryo serialization - better performance compared to the default Java serialization in most cases.
- Reduce Kafka retention and TTL to the required minimum (to reduce filtering time in the Spark custom data source)

* Code Optimizations:
- In Spark custom data source - split the data equally between files (to avoid skew)
- Avoid unnecessary actions
- Caching

* Infrastructure:
- Work with latest frameworks –
- Spark version upgraded to 3.1
- Working with Java 11 and G1GC

Session Speakers

Nir Dror

Principle Performance Engineer

Akamai

Kineret Raviv

Principal software developer

Akamai

Data+AI サミットの様子をご覧いただけます

Watch on demand