Session

Leveraging GenAI for Synthetic Data Generation to Improve Spark Testing and Performance in Big Data

Overview

ExperienceIn Person
TypeLightning Talk
TrackData Engineering and Streaming
IndustryEnterprise Technology, Retail and CPG - Food
TechnologiesApache Spark, Llama, Mosaic AI
Skill LevelIntermediate
Duration20 min

Testing Spark jobs in local environments is often difficult due to the lack of suitable datasets, especially under tight timelines. This creates challenges when jobs work in development clusters but fail in production, or when they run locally but encounter issues in staging clusters due to inadequate documentation or checks. In this session, we’ll discuss how these challenges can be overcome by leveraging Generative AI to create custom synthetic datasets for local testing. By incorporating variations and sampling, a testing framework can be introduced to solve some of these challenges, allowing for the generation of realistic data to aid in performance and load testing. We’ll show how this approach helps identify performance bottlenecks early, optimize job performance and recognize scalability issues while keeping costs low. This methodology fosters better deployment practices and enhances the reliability of Spark jobs across environments.

Session Speakers

Satej Kumar Sahu

/Principal Data Engineer
Independent Community