Improving Apache Spark Structured Streaming Application Processing Time by Configurations, Code Optimizations, and Custom Data Source
On Demand
Type
- Session
Format
- Hybrid
Track
- Data Engineering
Difficulty
- Intermediate
Room
- Moscone South | Level 2 | 215
Duration
- 35 min
Overview
In this session, we'll go over several use-cases and describe the process of improving our spark structured streaming application micro-batch time from ~55 to ~30 seconds in several steps.
Our app is processing ~ 700 MB/s of compressed data, it has very strict KPIs, and it is using several technologies and frameworks such as: Spark 3.1, Kafka, Azure Blob Storage, AKS and Java 11.
We'll share our work and experience in those fields, and go over a few tips to create better Spark structured streaming applications.
* Configuration improvements:
- Increase the amount of tasks for reading the data from the blob storage (we found out that smaller tasks improve the total micro-batch time - Less errors, less retries)
- Reduce Spark data locality wait to 0 (in case that data is being read from an external source like Azure blob storage)
- Kryo serialization - better performance compared to the default Java serialization in most cases.
- Reduce Kafka retention and TTL to the required minimum (to reduce filtering time in the Spark custom data source)
* Code Optimizations:
- In Spark custom data source - split the data equally between files (to avoid skew)
- Avoid unnecessary actions
- Caching
* Infrastructure:
- Work with latest frameworks –
- Spark version upgraded to 3.1
- Working with Java 11 and G1GC
Our app is processing ~ 700 MB/s of compressed data, it has very strict KPIs, and it is using several technologies and frameworks such as: Spark 3.1, Kafka, Azure Blob Storage, AKS and Java 11.
We'll share our work and experience in those fields, and go over a few tips to create better Spark structured streaming applications.
* Configuration improvements:
- Increase the amount of tasks for reading the data from the blob storage (we found out that smaller tasks improve the total micro-batch time - Less errors, less retries)
- Reduce Spark data locality wait to 0 (in case that data is being read from an external source like Azure blob storage)
- Kryo serialization - better performance compared to the default Java serialization in most cases.
- Reduce Kafka retention and TTL to the required minimum (to reduce filtering time in the Spark custom data source)
* Code Optimizations:
- In Spark custom data source - split the data equally between files (to avoid skew)
- Avoid unnecessary actions
- Caching
* Infrastructure:
- Work with latest frameworks –
- Spark version upgraded to 3.1
- Working with Java 11 and G1GC
See the best of Data+AI Summit
Watch on demand