Designing the Next Generation of Data Pipelines at Zillow with Apache Spark

Download Slides

The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization. Additional ingestions from data sources are frequently added on an as-needed basis, making it difficult to leverage shared functionality between pipelines. Identifying when technical debt is prohibitive for an organization can be difficult, but remedying it can be even more so. As the Zillow data engineering team grappled with their own technical debt, they identified the need for higher data quality enforcement, the consolidation of shared pipeline functionality, and a scalable way to implement complex business logic for their downstream data scientists and machine learning engineers.

In this talk, the Zillow team explains how they designed their new end-to-end pipeline architecture to make the creation of additional pipelines robust, maintainable and scalable, all while writing fewer lines of code with Apache Spark.

Members of Zillow’s data engineering team discuss:

  1. How they identified pain points in the development, maintenance, and scaling of their data pipelines
  2. The advantages and disadvantages of the ETL patterns considered
  3. How they ultimately leveraged their experience to architect more scalable, robust data pipelines using Apache Spark

  4. Video Transcript

    – Hello everyone, a very warm welcome to our talk on designing the next generation of data pipelines at Zillow with Apache Spark. – Welcome, my name is Nedra Albrecht. I joined Zillow in October of 2017. I’m really excited I get to work for Zillow, I get to combine two of my favorite things, real estate and data. Speaking of data, I have over 20 years database development experience working with all types of database systems. I also have extensive experience with data modeling and data architecting new processing systems. I’m also fortunate to be a founding member of the Zillow Offers data engineering team. – And my name is Derek Gorthy, I joined Zillow back in August of 2019. And I actually met my Zillow recruiter here at Spark Summit last year. So while we don’t have this in person networking opportunity, I’m looking forward to meet a lot of you virtually. My background is in developing highly scalable data pipelines, both batch and streaming, and machine learning applications, and I have over four years of experience in Apache Spark.

    – Our goal today is to describe the data pipeline architecture used to support Zillow Offers as it evolved. I’ll start by explaining how Zillow Offers works. Then I’ll talk about our original Zillow Offers requirements and our team’s responsibilities within the organization. Then I’ll describe our original architecture and how it relates to the original requirements. Together we’ll present how the scope of our project has changed and the impact that’s had on our team. Then Derek will present our next generation requirements and explain our new architecture in detail.

    He will delve into the first three sections, highlighting different designs decisions that we use to improve both data quality and development velocity. I’ll finish with the last two architectural sections and enumerate our key takeaways from this.

    We’d love your feedback on this session, so if you have a moment please drop us a review.

    You’re probably already familiar with Zillow, but you might not know about Zillow Offers. Zillow Offers is a whole new way to sell homes. How it works is interested sellers complete their application answering a few questions about the home, doing this online. After a few days, the Zillow Offers advisors will contact them and provide an initial offer. The initial offer is calculated using a number of pieces of data including those estimate, facts and comparable sales of nearby homes. If the seller is interested in continuing we schedule an in person evaluation and conduct an inspection. The prices are adjusted using this information and if the seller is happy with this adjusted offer, the seller chooses their closing date and Zillow will purchase the home. On possession, Zillow will then renovate the home and we use data to determine the optimal amount of renovation to complete.

    Lastly, we list the house for sale with a Zillow Premier Agent.

    Here’s a video that shows one of the many benefits of selling a home using Zillow Offers.

    What is Zillow Offers?

    – Okay one minute, come on family. – I got the fish. – [Narrator] Leave for last minute showings. – [Dad] Go, go, go, go. (upbeat music playing) – [Narrator] Or just sell your house to Zillow. Skip the disruption of showings and get a competitive all cash offer, Zillow Offers.

    – I don’t know if anyone’s sold a house but honestly not having to show the house is like a big motivation for me to want to use Zillow offers.

    An internal perspective though, Zillow offers operated as a startup within our company. Zillow really embrace the development of Zillow offers by empowering teams to experiment with new tools and infrastructures and models. It’s pretty obvious that we’re a data driven company and the effective generation and consumption of data is essential to running our business. Our team’s charter is to collect all data that’s necessary to run the business, serve as a source of truth for that data and service that data in a way that’s consumable by other teams.

    From the very beginning of Zillow offers, the primary focus has been on speed of delivery. Our requirements were reflective of that. We scaled very quickly over the last few years, and we built out a great deal of tooling and infrastructure to support this business. Initially, the product teams implemented a mix of tooling. certain features were developed internally and others were filled with third party tools. This meant data engineering had to support as many pipelines as there were sources to run our business. Conjointly we had to be able to develop pipelines very quickly. Since this data is the source of truth for the business. We had to expose it very quickly. And it’s critical to be able to iterate on the Zillow Offers business. In addition to that, how the data is exposed was very important. Our analytic teams need the flexibility to implement and maintain this business logic layer. And data engineering couldn’t be a bottleneck with respect to the schema management or the data cleansing.

    Original Architecture

    This is the original architecture designed to support Zillow offers. Looking from left to right, our pipeline start with the collection of data. As I had mentioned, we had to support both internal and external sources. For our internal sources, product teams was streaming this data via Kinesis and we’d landed in S3. For external sources, we use typical API Call to land the same data. The data landed as JSON and the first thing we do is convert it to Parquet. Depending on the requirements of the pipeline, a small amount of custom logic might be implemented. And this is the point where we would merge deltas. All of our pipelines are run using airflow as an orchestrator and all logic is executed as combination of both airflow operators and spark jobs. Each pipeline is architected independently as indicated by the different colored boxes on the diagram.

    In our last stage of processing, we combine data sources that have similar structure and business rules into a single combined table. This is reduces the number of tables that we have to maintain. Internal datasets sent via Kinesis actually had the actual data stored as a JSON payload on the row. This data stays embedded as JSON throughout the pipeline instead of being exposed as a static schema. This pattern reduces the burden on data engineering to manage and evolve the schema.

    Lastly, we expose the data via Hive and Presto and Tables. But really all the action happens in the Presto views. All data cleansing, data type validations, date format conversions and other transformations are very easily updated and maintained in this layer. Schema changes are easily incorporated by using a simple JSON extract command within the query. And really views offer enormous flexibility to our analytic teams to iterate on the ever changing business rules.

    Data Engineering Scope

    This original architecture has worked well as Zillow Offers was proved out. Data engineering was able to quickly onboard data sources and data analytic teams had wide flexibility to manipulate the data. As I previously mentioned, the initial focus for developing Zillow offers was on speed of delivery. We launched in April of 2018 in a single market, Phoenix, but by 2020, we had expanded to over 20 markets. That’s roughly one market a month for almost two years. This rapid growth really was reflected in the number of data asks increasing exponentially over this time. Some data asks were for onboarding new data sources as to be expected but many many more were related to understand and improving the data quality as it was presented by the analytic teams. In order to maintain a high velocity in the original architecture, we did implement a few data quality checks in the pipeline itself, but most data quality transformations were handled in the views.

    That meant our analytics teams had to be involved in responding to data quality issues that potentially could have been proactively addressed by data engineering as well. At this time, I’ll hand over this presentation to Derek, who will talk about the updated requirements and our next generation architecture. Derek? – Thank you Nedra. Fast forward to 2020 like Nedra said, in the two years since Zillow offers opened its first market in Phoenix, we’ve seen just an explosion in the number of data asks from our downstream teams. And while Zillow has invested consistently in data engineering resources, the team has more than doubled in size since I joined last August. We project that this gap between headcount and the number of data asks will continue to widen. Pair that with the time spent addressing production issues increasing relative to the time spent on new development. We determined that our existing model would be unsustainable as we scaled to keep pace with the growing Zillow Offers business. We must create a new architecture that will allow us to scale easily, increase our velocity as a team, and increase the quality of the data that we present to our downstream teams. We’ve identified three key requirements for this new architecture.

    First, we want to decrease the time that it takes to onboard new data sources to keep up with the growing demands from our stakeholder teams, all while maintaining the flexibility that allows us to support a variety of internal and external data sources. Second, we want to address data quality issues earlier in our pipelines to ensure that these issues don’t impact our stakeholders, especially in the pipelines that drive reports used by executives that make business decisions critical to the Zillow Offers business. Third, we want to shift from independent pipeline specific development to library-based development that can be extended across Zillow. The work we’re doing should enable other teams, not just the Zillow Offers data engineering team.

    This diagram shows our new architecture that addresses these three new requirements.

    New Architecture

    We use the same style of shapes and colors so that it’s easy to compare the two architectures. Looking left to right, pipelines still start with the collection of data. One key difference here is that the pipelines are picking up data from s3 that are landed from Kafka, not Kinesis. The landing of third party data services is more or less the same. Moving right, this large dark blue box represents our orchestration layer. Our pipelines now use airflow only for orchestration and generate pipelines by parsing YAML configuration and Avro schema files. Moving down to the three stacks blue boxes that represent individual pipelines. We’ve stacked them to represent the pipelines as entities that are now dynamically generated as opposed to independently architected. Looking into the components within the stack blue boxes, we see how the pipelines are actually structured. Each gray parallelogram represents a distinct transformation that we’ve implemented in a library that we call Pipeler. This library is written in Scala and primarily leverages Apache Spark for data processing. First, we perform file type conversions, schema validations and data validations then land each event into its own hive table in the valid data set layer. This is the first time that the data is surfaced in a hive table and is primarily used for retaining historical data. Next, we merge the new partition with the existing data set if needed, and flatten out the nested structure of each objects to make the data easier to work with. After these steps, we land each event into its own serve data set, which is the first hive table that we expose to our downstream teams. Finally, we’ll apply source specific business logic and perform data auditing, before landing that data into the final hive layer that we call our data marts. This logic is still source specific and custom. But it’s now done in Spark SQL instead of Presto JSON extract. I know that that’s a lot to digest all at once, but we want to provide a holistic view of the new architecture. Given this context, we’ll elaborate more on the key components of this new architecture and expand on how they fit into the larger picture, how they tie back to our requirements and how they impact quality and or velocity.

    Establish Processing LayerS

    In our new architecture, we established three distinct processing layers that now allow us to view data in various stages of the pipeline. I remember that this was a hot topic at Spark Summit here last year. So I’ll just explain how we think about these three layers. The first we call the valid data set layer. This is a faithful representation of the source data with basic schema and data validations applied, and it’s primarily used for retaining historical data. Each hive table in this layer contains only one event where previously we left events embedded in JSON columns. This makes it easier to enforce schema now that we’re only handling one event at a time. The second is a serve data set. The exact transformations here vary by source, but we focus on preparing the valid data set for easy consumption. This is the first layer that we’ll expose to our downstream teams. The third and final layer is our data marts. We apply specific business logic to our serve data sets to present a denormalized view of the data. This layer is used to build objects to support aggregated data sets and reporting and can combine one or more serve data sets. depending on the context of the events. Tying this back to our chart in the top right corner, this greatly improved our quality because we can now broadly apply schema validation and data quality checks across all our pipelines.

    Pipeler Library

    The second component in our new architecture is the implementation of a spark processing library that we call Pipeler. Because pipelines, get it. (laughing) We weigh the pros and cons of using external libraries or developing our own because we want to iterate quickly and contribute to something that can be extended across Zillow, as well as being highly customizable. We decided to build a library that implements common transformations that we saw were patterns throughout our existing pipelines. While this could be a talk on its own, we see these three improvements as the most significant. First, each processing block does one distinct operation and is now more efficient because we’re implementing it in Scala. This allows us to mix and match these processing blocks depending on the characteristics of the data that we’re processing. Let’s take one step as an example, our data conversion or in this graphic, convert to Parquet step. The purpose of this block is to take data that we receive in various formats, whether that be CSV, JSON, or really any input format that we want to support and convert it into Parquet. It’s very clear what this transformation is doing. And it’s something that we can use throughout many of our pipelines. Second, an operation is implemented in a single place. This enables us to standardize the implementation of an operation and write it in fewer lines of code, all while increasing the test coverage on our processing logic. As many of you know, it’s difficult to achieve 100% unit test coverage and custom scripts, especially if you’re maintaining a lot of them. In the case of our data conversion block, we can write tests to cover each input format that we accept and the different parquet right parameters that we choose to support. Third, debugging pipeline failures is a more simplified process. We know which operation failed and can pinpoint the transformation that the pipeline is failing on just by looking at airflow. In addition to this, we have intermediate output between steps that can be used for debugging. So I can quickly take a look at the state of the data before the transformation, and then step through the logic to see exactly where it’s failing. One of the biggest changes is that data quality is now decoupled from code quality. And we mitigate the risks that come with implementing similar functionality in independent pipelines and can now rigorously test each processing block separately. Taking a step back here, this was really a fundamental shift in how we thought about the role of a data engineer, and that we can spend more time building tooling and less time developing one off pipeline logic. And in doing so we gain efficiencies at scale as we dedicate the time that we would be spending developing individual Pi Spark scripts to refining our library logic and writing additional test cases. Tying this back in, this component increases quality by improving the overall code quality in our repositories, and increases velocity over the long run by developing tooling upfront that we can then leverage in all of our data sources.

    Config-driven Orchestration 5 Pipeline Generation ©

    Next, we’ll look at our config driven orchestration layer. In our architecture, we drive the generation of our pipeline through three components. The first component is a YAML configuration file that’s structured similar to the code snippet on the left. If we think about this abstractly, a configuration file needs a list of transforms, configurations to those transforms and information to construct the actual pipeline. Looking at the transform configurations piece, this should define all the information needed to configure the step. Most of the steps that we support are Pipeler transformations. But we can also support S3 sensors, airflow task sensors and any other steps that we need to support in our pipelines. As you can see from the sample config, common parameters to Pipeler transformations are input and output pathing, the number of file partitions to write, hive table information and other transformation specific parameters. In airflow we need to configure the airflow DAG object with parameters like schedule time, run frequency and alerting information. The second component is a set of schema files that define how the data is structured in the three hive layers throughout our pipeline. We use these schema files to create the actual hive tables and to drive the schema validation step at the beginning of the pipeline. The third component is an orchestration library that parses the configuration files, constructs the Pipeler calls and other operational blocks relates those steps together and then constructs the airflow DAG and tasks. The thinking here is similar to how we thought about Pipeler and that data engineers can now spend more time building pipeline generation utilities, so that the only work that goes into setting up a new pipeline is creating the configuration file and the Avro schema files for the new data source. We’ve implemented this library in such a way that Pipeler calls are constructed before passing the call to the airflow operator. We made this design decision so that the configuration parsing logic is independent of airflow, and that we can move to another orchestrator with minimal rework. In doing this, we increase the velocity by actually decreasing the time it takes to stand up a new pipeline from a matter of days to a matter of hours. This makes us a lot more responsive to data onboarding requests from our downstream teams. I’ll hand the rest of the presentation off to Nedra to cover the two next architectural blocks and to wrap up the presentation with key takeaways.

    Data Processing vs. Business Logic

    – Thank you Derek. Earlier I detailed how we use presto views to manage the bulk of our data manipulation. Because data processing and business logic is commingled. The cleansing and the non business logic related code is quite repetitive amongst the views. In the new architecture, we broke out this common logic and migrated it to transforms which we implemented in Pipeler. The examples provided here are the merge deltas and the flattened arrays, but we aren’t limited to just those transforms. As new patterns are requested or discovered, we in turn implement those into the related transform library.

    Separating this logic has significantly simplified the remaining business logic code. Now, business logic can be evaluated apart from the data processing logic, and we can play data auditing rules. Be able to check that the data received matches the pattern we expect, really it greatly improves our ability to monitor our data quality over time. Our typical checks include, checking for duplicates, Checking for rows that violate the business rules or checking for rows that differ in a pattern that we wouldn’t expect. If our dataset passes we promote it for use by the an analytic teams.

    Creating complex data marts has also been vastly simplified. Once the serve data is ready, once it’s been exposed, it’s ready to go. All needed data cleansing and validation has already been applied. Our served data sets can then be combined to create simple data marts which then can be combined further into more complex data marts. When layering this business logic we improve the ability to audit and troubleshoot individual logic statements. Initially, our analytic teams were responsible for the maintenance and improvement of views. And the data engineering team was responsible for addressing issues that arise from those views. To reliably and proactively improve our data quality, our team had to assume ownership of this core business logic. We still support supplemental views, but all business critical data sets have been re factored for efficiency and scalability. Business logic was refactored using Spark SQL so that our and analytical teams could continue to contribute to this logic. The SQL itself is written to be run as part of the pipeline processing.

    In splitting out the data processing from the business logic we gained improvements in data quality and development velocity by layering the business logic. The volume veracity and variety of our data has continued to change as we have evolved the Zillow offers business model.

    Validating Data Early

    Which means our team had to be prepared to handle those changes. The analytic teams rely on this data to make timely and accurate decisions and a major way we improve data quality is by applying checks early as possible to ensure the data processing aligns with what we expect. We created two steps to handle this validation In the schema validation step, we compare the schema of the data set, the actual fields sent against unexpected Avro schema. Derek talked about this in the config driven orchestration slide. We handle schema evolution semi automatically, we add columns and do simple typecasting whenever it’s needed. If we have missing columns, or there is a larger data type mismatch, our pipelines will throw an error. This is also the step where data is extracted from the JSON payload and set up to be exposed in hive tables as columns. Exposing the schema simplifies our queries further down the pipeline and for the analytic teams.

    In our data validation step, we perform checks that relate to the data contents. For instance, we check for NULL and NOT NULL fields, values that are not in any num list, and values that exists outside of a range and more.

    Data Validation is not meant to correct data but instead detect issues that might result in downstream failures later on. One thing that our team also noticed is that we cannot defensively code against all possible scenarios. Collaboration is key to ensuring that data quality is shared equally amongst all teams. Data contracts and data standards are two ways we have formalized these collaborations. By data contracts, I mean the rules that are used to define what the pipeline is. This includes the Avro schema, the data validation rules and the data auditing rules. We even put standards in place for everything from naming to pathing to config structure. This ensures consistency across the pipelines, which reduces the effort to support and maintain them and increases our overall data quality.

    Building the next generation of Zillow offers data pipelines has been very exciting. And there’s three main takeaways we’d like to share with you. First, how data engineers should think about pipeline development is of paramount importance. Our focus had to shift from looking and creating specialized pipelines to holistically looking at all of our pipelines and determining the commonalities they share. Practically speaking though, that meant the data engineering had to contribute to a set of shared resources and tooling, as well as continue to understand and onboard new datasets using this framework. Secondly, data quality is very important, and has to be checked at various stages of the pipeline. Detection and alerting on data quality issues early on allows us to be proactive in our responses. Producing data quality reliably also increases the trust teams put in our data and improves the decisions the business makes. Finally, we learned that data quality is a shared responsibility amongst teams and collaboration is needed by everyone to be successful. Product teams and data engineering teams need to partner together to do develop the schema and check rules that conform to our standard. In conclusion, the data pipeline architecture has evolved based on our ever changing needs of the Zillow offers business. We’ve made changes in how we think and process data, which has resulted in direct improvements to our data quality and development velocity. And as Zillow Offers continues to evolve, we as a team will also continue to evolve.

    Thank you for attending our presentation. It has been a pleasure to share with you both the original and new architecture that underlies the Zillow offers. We’d like to end with a short video that reflects our mission statement. Helping people unlock life’s next chapter, Found our presentation very informative. Please stick around for the question and answer period that follows this presentation.

    « back
About Derek Gorthy


Derek Gorthy is a senior software engineer on Zillow’s Big Data team. He is currently focused on leveraging Apache Spark to design the next generation of pipelines for the Zillow Offers business. Previously, Derek was a senior analyst at Avanade, implementing ML applications using Spark for various companies across the technology, telecom, and retail sectors. For this work, he received the Databricks Project Partner Champion award at the 2019 Spark+AI Summit. He has a BS in Computer Science and Quantitative Finance from the University of Colorado, Boulder.

About Nedra Albrecht

Zillow Group

Nedra Albrecht is a senior data engineer at Zillow with over 20 years-experience working with data. Nedra has worked extensively with data modeling and data architecture for every major data paradigm, including transactional data processing (OLTP), data warehousing (OLAP), and now big data. She is currently focusing on architecting scalable, generic, config driven data pipeline systems that handle the wide variety of data sources that Zillow consumes.