Leveraging GenAI for Synthetic Data Generation to Improve Spark Testing and Performance in Big Data
Overview
Experience | In Person |
---|---|
Type | Lightning Talk |
Track | Data Engineering and Streaming |
Industry | Enterprise Technology, Retail and CPG - Food |
Technologies | Apache Spark, Llama, Mosaic AI |
Skill Level | Intermediate |
Duration | 20 min |
Testing Spark jobs in local environments is often difficult due to the lack of suitable datasets, especially under tight timelines. This creates challenges when jobs work in development clusters but fail in production, or when they run locally but encounter issues in staging clusters due to inadequate documentation or checks. In this session, we’ll discuss how these challenges can be overcome by leveraging Generative AI to create custom synthetic datasets for local testing. By incorporating variations and sampling, a testing framework can be introduced to solve some of these challenges, allowing for the generation of realistic data to aid in performance and load testing. We’ll show how this approach helps identify performance bottlenecks early, optimize job performance and recognize scalability issues while keeping costs low. This methodology fosters better deployment practices and enhances the reliability of Spark jobs across environments.
Session Speakers
Satej Kumar Sahu
/Principal Data Engineer
Independent Community