Which Data Broke My Code? Inspecting Spark Transformations

Download Slides

Apache Spark is rapidly becoming the dominant big data processing framework for data engineering and data science applications. The simplicity of programming big data applications in Spark and the speed gained with in-memory processing are key factors behind this popularity. However, tools that aid developers build and debug Spark applications have not kept pace.

For instance, a Spark application can perform multiple transformations on data in a distributed environment of up to 100s of executors and 1000s of tasks. If this application fails due to patterns in the data that are not handled by the code, one is reduced to using archaic tools like print statements and log trolling to iteratively narrow down the root cause.

This talk describes an alternative approach that presents the user with a familiar paradigm to solving this problem. If a developer can step through application code and apply watchpoints so that she can inspect data when the desired condition is met, she can then easily identify the section of code that cannot handle the data. We describe the framework needed in order to make this user experience real and the challenges that need to be overcome in the process.

Session hashtag: #DevSAIS12

« back
About Vinod Nair

Vinod Nair leads product management at Pepperdata. He brings more than 20 years of experience in engineering and product management to the job, with a special interest in distributed systems and Hadoop. He has worked in software for telecommunications, financial management for small business, and big data. Vinod's approach to product management is deeply influenced by his success in applying Lean Startup principles and rapid iteration to product design and development.