Roi Teveth is a big data engineer at Nielsen Identity Engine, where he specializes in research and development of solutions for big data infrastructure using cutting-edge technologies such as Spark, Kubernetes and Airflow. In the past 6 months, he has been actively involved in open-source projects and specifically Airflow. In addition, Roi has a vast system engineering background and is a CNCF certified Kubernetes administrator.
November 18, 2020 04:00 PM PT
At Nielsen Identity, we use Apache Spark to process 10's of TBs of data, running on AWS EMR. We started at a point where Spark was not even supported out-of-the-box by EMR, and today we're spinning-up clusters with 1000's of nodes on a daily basis, orchestrated by Airflow. A few months ago, we embarked on a journey to evaluate the option of using Kubernetes as our Spark infrastructure, mainly to reduce operational costs and improve stability (as we heavily rely on Spot Instances for our clusters). To allow us to achieve those goals, we combined the open-sourced GCP Spark-on-K8s operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) with a native Airflow integration we developed and recently contributed back to the Airflow project (https://issues.apache.org/jira/browse/AIRFLOW-6542). Finally, we were able to migrate our existing Airflow DAGs, with minimal changes, from AWS EMR to K8s.
In this talk, we'll guide you through migrating Spark workloads to K8s, including:
* Challenges with existing Spark infrastructure and the motivation to migrate to K8s
* Aspects of running Spark natively on K8s (e.g monitoring, logging, etc.)
* Best practices for using Airflow as the orchestrator
Speakers: Itai Yaffe and Roi Teveth