HomepageData + AI Summit 2022 Logo
Watch on demand

ÀLaSpark: Gousto's Recipe for Building Scalable PySpark Pipelines

On Demand

Type

  • Session

Format

  • Virtual

Track

  • Data Engineering

Difficulty

  • Intermediate

Duration

  • 0 min
Download session slides

Overview

Creating Spark Data Pipelines at scale is often repetitive and error-prone: there are many small configurations to adjust on sources and sinks, enforcing naming conventions is difficult, reusing components is not always possible, and unit testing is rarely implemented. At Gousto, one of the leading recipe box companies in the UK, we’ve adopted a builder pattern to create a set of building blocks to overcome those difficulties. We have created a number of simple-to-maintain abstractions over PySpark: Readers, Processors and Writers. They are standardised ways to perform Spark operations and can be reused in both streaming pipelines and batch jobs, while still retaining familiar code syntax that we’re used to with PySpark.

For example: Suppose you want to ingest CSV data as a batch job, apply some standard processing, such as adding the source filename and the processing timestamp, and then save it to Delta. This pipeline is now as simple as chaining together a set of blocks: CsvBatchReader->AddSourceFilenameProcessor->AddTimestampProcessor->DeltaBatchWriter. All the Spark configuration is already done in each component; to create the job you just need to string them together!
Since implementation, building new pipelines has become much faster, with development and testing time dropping from weeks to days.

In this session we will deep dive into the design patterns we followed, some unique approaches we’ve taken on how we structure pipelines and show a live demo of implementing a new spark streaming pipeline in Databricks from scratch. We will even share some example python code and snippets to help you build your own.

Session Speakers

Daniel Baron

Data Engineer

Gousto

Elena Martina

Data Engineer

Gousto

See the best of Data+AI Summit

Watch on demand