At Nielsen Identity, we use Apache Spark to process 10’s of TBs of data, running on AWS EMR. We started at a point where Spark was not even supported out-of-the-box by EMR, and today we’re spinning-up clusters with 1000’s of nodes on a daily basis, orchestrated by Airflow. A few months ago, we embarked on a journey to evaluate the option of using Kubernetes as our Spark infrastructure, mainly to reduce operational costs and improve stability (as we heavily rely on Spot Instances for our clusters). To allow us to achieve those goals, we combined the open-sourced GCP Spark-on-K8s operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) with a native Airflow integration we developed and recently contributed back to the Airflow project (https://issues.apache.org/jira/browse/AIRFLOW-6542). Finally, we were able to migrate our existing Airflow DAGs, with minimal changes, from AWS EMR to K8s.
In this talk, we’ll guide you through migrating Spark workloads to K8s, including:
* Challenges with existing Spark infrastructure and the motivation to migrate to K8s
* Aspects of running Spark natively on K8s (e.g monitoring, logging, etc.)
* Best practices for using Airflow as the orchestrator
Speakers: Itai Yaffe and Roi Teveth
– Thanks Itai, I am going to talk with you about Spark on Kubernetes is the platform the objective to achieve one of the things that we tried to do before. So let’s do a quick work up on what Kubernete is. So Kubernetes is an open source platform for running and managing containerized workloads. And it includes built in controllers to support various workloads such as micro services. And it also has additional extensions called operators to support custom workloads, such as Spark, and it’s highly scalable. Here we have a Kubernetes cluster. On this cluster, we have the pods. They are the application that’s running. They are a group of one or more containers such as Docker containers, and they are running on the worker nodes, EC2 in our case. In every cluster there also is a control plane. It’s assigning the pods to the nodes and also provides other cluster management services. The kubectl is the Kubernetes CLI. With the kubectl we can see which codes are running on the cluster and also launch new workloads. The term “operator” exists both in Airflow and in Kubernetes, but they are quite different. In Airflow, an operator represents a single task. The operator determines what is actually execute when your DAG runs. So for example, we have the batch operator that executes a batch command. On the other end we have the Kubernetes operator, with additional extensions to Kubernetes, And it holds the knowledge of how to manage a specific application. For example, we have the postgres operator that defines and manages a PostgreSQL cluster. So Airflow operator is not equal to Kubernetes operator. Another cool feature of Kubernetes is the cluster auto scale. At this stage, we have the Kubernetes cluster without any pods running on it. Then it doesn’t have working nodes either. When launching a new pod it gets distant depending state. And the cluster is triggering no to scaling, and new instance is joining the cluster. And my pod is running. When I launch another pod, it gets the pending state, too. The cluster is triggered in Dota scaling again, and another instance is joining the cluster, and my two pods are running. When my pods finish working, the cluster scales in again, and now I’m paying only for the cluster management fee, which is very cheap, in this case it’s $60 per month. So Kubernetes in a nutshell is a platform for running and managing containerized workloads. In each cluster there is one control plane zero to thousands of worker nodes, zero to tens of thousands of applications running concurrently. A Kubernetes operator is not equal to an Airflow operator. And the cluster can automatically scale out and in. Cool, so now we can start talking about Spark on Kubernetes. So from Spark version 2.3 0.04, Kubernetes is supported as a cluster manager. We are not paying any management fee per instance, we’re paying only for the instance itself and $60 per month for the cluster management fee, at Amazon case, which is very cheap. This feature is still experimental and some of the features that are missing and not complete. For example, we have the dynamic resource allocation that is lacking and external shuffle service, that is is not implemented. How can we submit Spark application to Kubernetes? We have two alternatives. The first one is using the Spark submit script that we all know. And the other alternative is by using the Spark-On-Kubernetes operator. let’s start with the Spark submit. So here nothing really out of the ordinary, other than the master argument I’m giving you the Kubernetes API server address, and the spark submit sending the SparkPi or definition to the Kubernetes cluster. And now I have Spark driver up and running and the driver is controlling the executor’s needs. On the other end, we have the Spark-On-Kubernetes operator which is a Kubernetes operator and it extends the Kubernetes API to support Spark applications native. So now Spark knows, so now Kubernetes knows what Spark application is, and it’s built by GCP as an open source project. You can see the link below. So with the Spark-On-Kubernetes operator, I’m writing my Spark application parameters and arguments as the Kubernetes object manifest, with the type Spark application. Now I can send this manifest with kubeCTL and the cluster gets it. We have the Spark-On-Kubernetes operator running on the cluster, and doing the Spark submit for us. And now we have the Spark driver pod running on the cluster. And again, the driver pod is controlling its own executors. Let’s do a quick comparison of these two methods. With Spot submit, it has Airflow built-in integration but this lacks the ability to customize outputs. You don’t offer easy access to the Spark UI, and we can’t submit and view applications with kubeCTL. With Spark-On-Kubernetes operator, it still don’t have airflow built in integration, but it has the ability to customize outputs. It offer easy access to the Spark UI and we can submit and view applications from kubeCTL. So we wanted to take one of the advantages of the Spark-On-Kubernetes operator, with Airflow. so we need to integrate them too. So we decided to take the road less traveled and built an Airflow integration with the Spark-On-Kubernetes operator. So now you can send a Spark application with Spark-On-Kubernetes operator from Airflow. We built, after we built this integration, we contributed back to their Airflow pencils project. So we can use it too. A special thanks is deserved to all of the Airflow committers and community that helped us along the way and gave us a lot of valuable mentoring. So let’s see how the Airflow Spark-On-Kubernetes integration looks like. So it consists of three components. And the first one is the SparkKubernetes operator, which is an Airflow operator, and it’s sending the Spark application to the Kubernetes cluster. We also have the SparkKubernetes sensor, at the same, so that pokes the Spark application and end successfully, it feels Spark application. And to do success or failure, if the Spark application fails. And we have the KubernetesHook is that helps Airflow to interact with the Kubernetes cluster. So what have we gained by building this integration? So first now we can submit Spark application with the Spark-On-Kubernetes operator from Airflow, and it’s built into the Airflow project so, with official support. The second one is security. We are saving the Kubernetes credentials inside the Airflow connection mechanism, which is encrypted. Also portability, we are using the same template as Kubernetes object, both when running format flow and when we are running over application manually for debugging. Then it’s Kubernetes native. It communicates directly with the Kubernetes API. So thanks, and now back to Itai.
– Thank you, Roi for this great explanation of Spark-On-Kubernetes and on the Airflow integration you built. So, going back to the common data pipeline pattern we started with, you can see the Airflow down here. And if we dig deeper into the architecture, you can see that basically this architecture hasn’t changed much. The only thing that has seriously changed is the fact that in our the data processing layer, we’re now running Spark-On-Kubernetes, but something is still missing. So what is missing? So we had to connect the dots and make it production ready. Remember I told you that we wanted to gain better visibility and robustness. So let’s see what we’ve done with that. So in terms of visibility, what we have is a dedicated Spark History server for each Kubernetes namespace that allows us to view the Spark UI for every running application in Spark, or for applications that already completed. Of course we needed metrics. So we sent both Spark metrics that are exposed via JmxSink, and system metrics that are collected using this cube state metrics plugin to . On top of that, we built the funnel dashboards that aggregate both the Spark and the system metrics. So we can see what’s going on in our clusters in real time. So we are developers and we obviously need logs. And in order to investigate logs we collect the logs using Filebeat and send them to Elasticsearch. And nothing in a production environment is completed without alerting. So we use Airflow callbacks in order to emit metrics to our monitoring system. And then we can trigger alerts whenever needed based on the values of those metrics. I also promised you to talk about robustness. So here are two tips in this domain. The first one is you can actually run a Spark application on multiple availability zones. That can be very beneficial when using Spark Instances, because it allows you to increase the availability of those instances, and decrease the probability of losing instances mid execution of your Spark application. This is very suitable when your Spark applications have low shuffled volume. The other thing is the AWS note termination handler. So it allows Kubernetes is to gracefully handle events such as easy to spot interruptions and it can slightly increase the stability of your cluster. This is open source. So you can take a look at the git report here and install it on your cluster. So the benefits for migrating to Kubernetes are the same as we discussed or what we started with. So you can save about 30% in your costs, because there was no additional cost per instance, like we have in EMR. We saw that we can get better visibility and make our systems more robust using the tips I mentioned. As per the Airflow integration that Ori worked and contributed, then this will be available in the upcoming release of Airflow, which will be Airflow 2.0. Now I’m sure many of you can create. So we suggested that you check out the backport package that’s already available for airflow 1.10.12 So basically we’ve showed you that with minimal changes to your Airflow DAGs, you can boost your data pipeline like we did. So you can save a lot of money every month. You can gain visibility and make your systems more robust. So just before we wrap things up, a few things we care about. Women in Big Data is a worldwide program that aims to inspire, connect grow, and champion the success of women in the big data and analytics field. There are over 30 chapters and over 17,000 members worldwide. Now everyone can join regardless of gender. So we encourage you to find the chapter near you on the Women in Big Data website. We also have the Nielson Identity center tech blog, where we write about the things, the awesome things that we do. Specifically during this migration process Ori and I wrote a two-part blog series about Spark Dynamic Partition Inserts. And how it behaves when you run Spark on EMR versus Spark-On-Kubernetes. Cool, that’s it for us. We thank you for paying attention and for listening, you can reach us out on Twitter and LinkedIn, and we’ll be taking questions now. Also, please don’t forget to rate our session on the Data and AI Summit website. Thank you everyone.
Itai Yaffe is a Principal Solutions Architect at Imply. Prior to Imply, Itai was a big data tech lead at Nielsen Identity, where he dealt with big data challenges using tools like Spark, Druid, Kafka, and others. He is also a part of the Israeli chapter's core team of Women in Big Data. Itai is keen about sharing his knowledge and has presented his real-life experience in various forums in the past.
Roi Teveth is a big data engineer at Nielsen Identity Engine, where he specializes in research and development of solutions for big data infrastructure using cutting-edge technologies such as Spark, Kubernetes and Airflow. In the past 6 months, he has been actively involved in open-source projects and specifically Airflow. In addition, Roi has a vast system engineering background and is a CNCF certified Kubernetes administrator.