HomepageData + AI Summit 2022 Logo
Watch on demand

Improving Apache Spark Structured Streaming Application Processing Time by Configurations, Code Optimizations, and Custom Data Source

On Demand


  • Session


  • Hybrid


  • Data Engineering


  • Intermediate


  •  Moscone South | Level 2 | 215


  • 35 min
Download session slides


In this session, we'll go over several use-cases and describe the process of improving our spark structured streaming application micro-batch time from ~55 to ~30 seconds in several steps.
Our app is processing ~ 700 MB/s of compressed data, it has very strict KPIs, and it is using several technologies and frameworks such as: Spark 3.1, Kafka, Azure Blob Storage, AKS and Java 11.

We'll share our work and experience in those fields, and go over a few tips to create better Spark structured streaming applications.

* Configuration improvements:
- Increase the amount of tasks for reading the data from the blob storage (we found out that smaller tasks improve the total micro-batch time - Less errors, less retries)
- Reduce Spark data locality wait to 0 (in case that data is being read from an external source like Azure blob storage)
- Kryo serialization - better performance compared to the default Java serialization in most cases.
- Reduce Kafka retention and TTL to the required minimum (to reduce filtering time in the Spark custom data source)

* Code Optimizations:
- In Spark custom data source - split the data equally between files (to avoid skew)
- Avoid unnecessary actions
- Caching

* Infrastructure:
- Work with latest frameworks –
- Spark version upgraded to 3.1
- Working with Java 11 and G1GC

Session Speakers

Nir Dror

Principle Performance Engineer


Kineret Raviv

Principal software developer


See the best of Data+AI Summit

Watch on demand