HomepageData + AI Summit 2023 Logo
JUNE 26-29, 2023
SAN FRANCISCO + VIRTUAL
Attend Live

Improving Apache Spark Structured Streaming Application Processing Time by Configurations, Code Optimizations, and Custom Data Source

On Demand

Type

  • Session

Format

  • Hybrid

Track

  • Data Engineering

Difficulty

  • Intermediate

Room

  •  Moscone South | Level 2 | 215

Duration

  • 35 min
Download session slides

Overview

In this session, we'll go over several use-cases and describe the process of improving our spark structured streaming application micro-batch time from ~55 to ~30 seconds in several steps.
Our app is processing ~ 700 MB/s of compressed data, it has very strict KPIs, and it is using several technologies and frameworks such as: Spark 3.1, Kafka, Azure Blob Storage, AKS and Java 11.

We'll share our work and experience in those fields, and go over a few tips to create better Spark structured streaming applications.

* Configuration improvements:
- Increase the amount of tasks for reading the data from the blob storage (we found out that smaller tasks improve the total micro-batch time - Less errors, less retries)
- Reduce Spark data locality wait to 0 (in case that data is being read from an external source like Azure blob storage)
- Kryo serialization - better performance compared to the default Java serialization in most cases.
- Reduce Kafka retention and TTL to the required minimum (to reduce filtering time in the Spark custom data source)

* Code Optimizations:
- In Spark custom data source - split the data equally between files (to avoid skew)
- Avoid unnecessary actions
- Caching

* Infrastructure:
- Work with latest frameworks –
- Spark version upgraded to 3.1
- Working with Java 11 and G1GC

Session Speakers

Headshot of Nir Dror

Nir Dror

Principle Performance Engineer

Akamai

Headshot of Kineret Raviv

Kineret Raviv

Principal software developer

Akamai

See the best of Data+AI Summit

Watch on demand