Daniel Imberman is a full-time Apache Airflow committer, a digital nomad, and constantly on a search for the perfect bowl of ramen. Daniel received his BS/MS from UC Santa Barbara in 2015 and has worked for data platform teams ranging from early-stage startups, to large corporations like Apple and Bloomberg LP.
June 25, 2020 05:00 PM PT
When supporting a data science team, data engineers are tasked with building a platform that keeps a wide range of stakeholders happy. Data scientists want rapid iteration, infrastructure engineers want monitoring and security controls, and product owners want their solutions deployed in time for quarterly reports. Collaboration between these stakeholders can be difficult, as every data science pipeline has a unique set of constraints and system requirements (compute resources, network connectivity, etc). For these reasons, data engineers strive to give their data scientists as much flexibility as possible, while maintaining an observable and resilient infrastructure. In recent years, Apache Airflow (a Python-based task orchestrator developed at Airbnb) has gained popularity as a collaborative platform between data scientists and infrastructure engineers looking to spare their users from verbose and rigid YAML files. Apache Airflow exposes a flexible pythonic interface that can be used as a collaboration point between data engineers and data scientists. Data engineers can build custom operators that abstract details of the underlying system and data scientists can use those operators (and many more) to build a diverse range of data pipelines. For this talk, we will take an idea from a single-machine notebook to a cross-service Spark + Tensorflow pipeline, to a canary tested, hyper-parameter-tuned, production-ready model served on Google Cloud Functions. We will show how Apache Airflow can connect all layers of a data team to deliver rapid results.