Derek Gorthy is a senior software engineer on Zillow’s Big Data team. He is currently focused on leveraging Apache Spark to design the next generation of pipelines for the Zillow Offers business. Previously, Derek was a senior analyst at Avanade, implementing ML applications using Spark for various companies across the technology, telecom, and retail sectors. For this work, he received the Databricks Project Partner Champion award at the 2019 Spark+AI Summit. He has a BS in Computer Science and Quantitative Finance from the University of Colorado, Boulder.
As the amount of data and the number of unique data sources within an organization grow, handling the volume of new pipeline requests becomes difficult. Not all new pipeline requests are created equal — some are for business-critical datasets, others are for routine data preparation, and others are for experimental transformations that allow data scientists to iterate quickly on their solutions.
To meet the growing demand for new data pipelines, Zillow created multiple self-service solutions that enable any team to build, maintain, and monitor their data pipelines. These tools abstract away the orchestration, deployment, and Apache Spark processing implementation from their respective users. In this talk, Zillow engineers discuss two internal platforms they created to address the specific needs of two distinct user groups: data analysts and data producers. Each platform addresses the use cases of its intended user, leverages internal services through its modular design, and empowers users to create their own ETL without having to worry about how the ETL is implemented.
Members of Zillow’s data engineering team discuss:
[daisna21-sessions-od]
June 25, 2020 05:00 PM PT
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization. Additional ingestions from data sources are frequently added on an as-needed basis, making it difficult to leverage shared functionality between pipelines. Identifying when technical debt is prohibitive for an organization can be difficult, but remedying it can be even more so. As the Zillow data engineering team grappled with their own technical debt, they identified the need for higher data quality enforcement, the consolidation of shared pipeline functionality, and a scalable way to implement complex business logic for their downstream data scientists and machine learning engineers.
In this talk, the Zillow team explains how they designed their new end-to-end pipeline architecture to make the creation of additional pipelines robust, maintainable and scalable, all while writing fewer lines of code with Apache Spark.
Members of Zillow's data engineering team discuss: