Gokul Prabagaren

Master Software Engineer, Capital One

Seasoned Spark professional developing and managing Spark workloads at scale for CapitalOne.

Past sessions

Summit 2020 Filtering vs Enriching Data in Apache Spark

June 24, 2020 05:00 PM PT

Apache Spark provides lot of options of joining the data for its data sets. This talk will focus on comparing the approach of Enriching the data (left outer join) versus filtering the data(inner join).How both approaches end up with same result and highlight the merits of Enriching the data approach helped us in Capital One. We at CapitalOne are heavy users of Spark from its initial days.This talk will provide more details of how we evolved from filtering to Enriching the data for credit card transactions and highlight what benefits we got by following Enriching the data approach. Being the financial institution,we are bound by regulation.We need to back trace all credit card transactions processed through our engine.Will be providing the details on how Enriching the data approach solved us this requirement. This talk will provide more context on how financial institutions can use Enriching the data approach for their Spark workloads and back trace all the data they processed. We have used the filtering approach in Production and what were it issues and why we moved to Enriching the data approach in Production will also be covered in this talk. Attendees will be able to take away more details on Enriching and filtering options to decide on their use cases.It will be more relevant for users who wants to trace their data set processing with more granularity in Apache Spark.