ÀLaSpark: Gousto's Recipe for Building Scalable PySpark Pipelines
- Data Engineering
- 0 min
Creating Spark Data Pipelines at scale is often repetitive and error-prone: there are many small configurations to adjust on sources and sinks, enforcing naming conventions is difficult, reusing components is not always possible, and unit testing is rarely implemented. At Gousto, one of the leading recipe box companies in the UK, we’ve adopted a builder pattern to create a set of building blocks to overcome those difficulties. We have created a number of simple-to-maintain abstractions over PySpark: Readers, Processors and Writers. They are standardised ways to perform Spark operations and can be reused in both streaming pipelines and batch jobs, while still retaining familiar code syntax that we’re used to with PySpark.
For example: Suppose you want to ingest CSV data as a batch job, apply some standard processing, such as adding the source filename and the processing timestamp, and then save it to Delta. This pipeline is now as simple as chaining together a set of blocks: CsvBatchReader->AddSourceFilenameProcessor->AddTimestampProcessor->DeltaBatchWriter. All the Spark configuration is already done in each component; to create the job you just need to string them together!
Since implementation, building new pipelines has become much faster, with development and testing time dropping from weeks to days.
In this session we will deep dive into the design patterns we followed, some unique approaches we’ve taken on how we structure pipelines and show a live demo of implementing a new spark streaming pipeline in Databricks from scratch. We will even share some example python code and snippets to help you build your own.