Anup is a Senior Software Engineer at YipitData, a fast-growing fin-tech startup that answers investor questions through alternative data analysis and research. At YipitData, Anup has helped migrate existing data infrastructure to the Databricks platform, managed tooling for ETL workflows through Apache Airflow, and led various projects focused on application visibility and software reliability. Previously, Anup worked in investment banking at Citigroup and studied at Indiana University. In his free time, Anup enjoys swimming and is interested in data privacy issues and regulation.
June 24, 2020 05:00 PM PT
Over the past year, YipitData spearheaded a full migration of its data pipelines to Apache Spark via the Databricks platform. Databricks now empowers its 40+ data analysts to independently create data ingestion systems, manage ETL workflows, and produce meaningful financial research for our clients. Today, YipitData analysts own production data pipelines end-to-end that interact with over 1,700 databases and 51,000 tables without dedicated data engineers. This talk explains how to identify key areas of data infrastructure that can be abstracted with Databricks and PySpark to allow data analysts to own production workflows. At YipitData, we pinpointed sensitive steps in our data pipelines to build powerful abstractions that let our analyst team easily and safely transform, store, and clean data. Attendees will find code snippets of utilities built with Databricks and Spark APIs that provide data analysts a clear interface to run reliable table/schema operations, reusable data transformations, scheduled jobs on spark clusters, and secure processes to import third-party data and export data to clients.
The talk will also showcase our system of integrating Apache Airflow with Databricks, so analysts can rapidly construct and deploy robust ETL workflows within the Databricks workspace. System administrators and engineers will also learn to utilize Databricks and Airflow metadata to discover large-scale optimizations of pipelines managed by analysts and create business value. Attendees will walk away with concrete strategies, tools, and architecture to drive their data analyst team to own production data pipelines and as a result, scale their engineering team and business.