홈페이지Data + AI Summit 2022 로고
Watch on demand

Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

On Demand

Type

  • Session

Format

  • Hybrid

Track

  • 데이터 엔지니어링

업종

  • 금융 서비스

Difficulty

  • Intermediate

Room

  • Moscone South | Upper Mezzanine | 155

Duration

  • 35 min
Download session slides

개요

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.
This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend)
This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.
We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for < $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Session Speakers

Headshot of Shariff Mohammed

Shariff Mohammed

Distinguished Data Engineer

Capital One

Data+AI Summit 하이라이트 보기

Watch on demand