Learn to Efficiently Test ETL Pipelines
- Data Engineering
- Moscone South | Level 2 | 205
- 35 min
I’ve spent the majority of my career as an application engineer in industries like banking or medicine where early detection of defects is critical. It wasn’t uncommon to see hundreds of unit tests on a project, all trying to protect a feature and prevent a bug. Each test ran in less than a second and the entire suite took at most 10 minutes to run, so having multiple tests was cheap.
When I switched into data engineering, I was surprised at how few unit tests existed on our ETL pipeline. The tests that we did have were large, took more than a minute to run, and only tested the happy path. Because the tests took so long to run, it wasn’t worth it to test edge cases, and bugs that had been found before weren’t recorded in the tests, so they were easy to repeat.
There were good reasons for this. Duplicate coverage in much of application engineering is cheap, usually costing only a few milliseconds per test. In contrast, creating tests that run quickly for spark and big data comes with a unique set of real challenges that drastically increases the run time of tests if they cover more than necessary.
When thinking about how much code to cover with their tests, application engineers will often use the testing pyramid. It’s great, in theory, but when it comes down to it, it doesn’t provide any practical advice on how to determine what code you need to run in order to ensure you’re only testing what’s necessary.
This talk is a story, with code samples in Python, about how I started using a heuristic to ensure my tests weren’t running more code than was needed in order to prove that the feature I was building worked. Using it, I was able to efficiently build in coverage for edge cases and unexpected bugs by reducing duplicate coverage. Others may find it useful too.