Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way

Download Slides

At Nielsen Identity, we use Apache Spark to process 10’s of TBs of data, running on AWS EMR. We started at a point where Spark was not even supported out-of-the-box by EMR, and today we’re spinning-up clusters with 1000’s of nodes on a daily basis, orchestrated by Airflow. A few months ago, we embarked on a journey to evaluate the option of using Kubernetes as our Spark infrastructure, mainly to reduce operational costs and improve stability (as we heavily rely on Spot Instances for our clusters). To allow us to achieve those goals, we combined the open-sourced GCP Spark-on-K8s operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) with a native Airflow integration we developed and recently contributed back to the Airflow project (https://issues.apache.org/jira/browse/AIRFLOW-6542). Finally, we were able to migrate our existing Airflow DAGs, with minimal changes, from AWS EMR to K8s.

In this talk, we’ll guide you through migrating Spark workloads to K8s, including:
* Challenges with existing Spark infrastructure and the motivation to migrate to K8s
* Aspects of running Spark natively on K8s (e.g monitoring, logging, etc.)
* Best practices for using Airflow as the orchestrator

Speakers: Itai Yaffe and Roi Teveth

– Hey everyone. Thanks for joining us. Roi and I are very excited to be here at the Data and AI Summit Europe today. But before we begin, let me show you something. So take a look at your data pipeline. Now, back to me, now back at your pipeline. Now back to me, sadly, your pipeline isn’t me, but if you migrate your Airflow based Spark jobs to Kubernetes, you can boost your data pipeline like me. So you can save tens of thousands of dollars per month, gain visibility, and make your systems more robust. How exactly? This is the question we answer today. So now we can officially start. My name is Itai Yaffe. I’m a principal solutions architect at Imply. And previously, the big data tech lead for a group at Nielsen. With me today is Roi Teveth, who is a Big Data developer at Nielson Identity, and a Kubernetes evangelist. You can reach us both over LinkedIn and Twitter. So let’s see what you will learn today. So first, you’ll learn how to easily migrate your Spark jobs to Kubernetes, in order to reduce costs, gain visibility and robustness, all that while using Airflow as you workflow management platform. So just a little bit about Nielsen. Nielsen is famously known for measuring the ratings, the TV ratings in the US, and other countries. It’s essentially a data and measurement company, which means we collect data from various sources. Nielson Identity unifies many of those proprietary data sets in order to generate holistic view of a consumer. Of course, all the data is anonymized, and essentially serves as a single source of proof of individuals in households. So a little bit about us in numbers, since we are a data company. So we have over 10,000,000 events coming into our Kafka classroom every day. To process the data, we launched 6,000 Spark nodes everyday, and injest 60 terabytes a day, into our data Lake, which is AWS, S3 based. Only now we have a few petabytes in that data link. And finally we use Apache Druid as our OLAP database, and we ingest tens of terabytes of data into it. So all that massive infrastructure imposes challenges. And the main challenges are scalability, fault-tolerance, and cost efficiency. So, you can probably imagine that we have dozens of workflows running around the clock. Originally, we used AWS Data Pipeline for workflow management, but we also wanted better visibility of configuration workflow, better monitoring statistics, and also the ability to share common configuration or code between workflows. So this is basically why we love Airflow, because Airflow met all those requirements and more. We’ve been using it in production for the last two plus years, with over 40 users, across four groups. We run about a thousand data pipelines every day, and data pipelines are called DAG’s in Airflow. We have about 20 automatic deployments of the DAGs every day. And thanks to the fact that Airflow is an open source project, we were able to make six contributions to it. So don’t worry if you’re not familiar with Airflow, this is just a way in Airflow to visualize your data pipeline. So if we take a closer look at a high level architecture, so this is actually a very common data pipeline pattern for us. From left to right, you can see that everything starts with files in our data lake. Those files are being processed by our Spark applications, which then write the output files to the intermediate storage, which is also S3 based. And then periodically, we load the processed data into our OLAP database, which is Druid, as I mentioned. So you probably all know that there are a few available cluster managers for Spark. We have Mesos, we have YARN, we have Standalone, and the quite recently added Kubernetes. There are also options to run managed Spark versions on public clouds, such as AWS EMR, Databricks of course, GCP Dataproc, et cetera. So if you focus on a data processing layer, we’re actually leveraging AWS EMR to run our Spark clusters, and we use YARN as our cluster manager. So what is EMR? EMR is basically an AWS managed service to run Hadoop and Spark clusters. It allows you to reduce costs by using Spot instances, which are basically spare compute resources you can get for a significant discount from your cloud provider. And because it’s a managed service you also pay a management fee for every instance in the cluster. So, let’s see a pricing example. Say we have an EMR cluster that costs $1000. Of that $1000, we found that about $650 are what we pay for the machines. And of course this is based on the specific instance type which is in our example, i3.8xl, and for the current spot pricing. This obviously can vary depending on the region, the instance that you’re using et cetera, but keeping this example, you can understand that there’s still some more money to be paid. And those $350 is what we pay as the EMR fee. Now this may seem a bit high, but since EMR is a fully managed service, you also get all the benefits from that being a service. Also, running Airflow based Spark jobs on EMR is rather easy, because EMR has official support in Airflow. So we have three components to run Spark applications from Airflow on EMR. The first one is the operator that basically creates new EMR clusters on demand. The second operator which is called EMR Add Steps, basically add the Spark step to the cluster, or submits that Spark application. And the last one is called a sensor and it basically checks if the step succeeded or failed. Now, remember Airflow is an open source project. So that allowed us to fix existing components, like this rather small fix that we did in the EMR steps sensor. It also allowed us to contribute things that are not necessarily related to EMR, but rather related to our data pipelines in general. So for example, this AWS Athena sensor and the OpenFaaS hook my colleagues contributed. So, this was all great, but we wanted more. So we wanted to further reduce our costs. We wanted to have better visibility, and we wanted to make our systems more robust. And now Ori will explain how we were able to achieve all that.

– Thanks Itai, I am going to talk with you about Spark on Kubernetes is the platform the objective to achieve one of the things that we tried to do before. So let’s do a quick work up on what Kubernete is. So Kubernetes is an open source platform for running and managing containerized workloads. And it includes built in controllers to support various workloads such as micro services. And it also has additional extensions called operators to support custom workloads, such as Spark, and it’s highly scalable. Here we have a Kubernetes cluster. On this cluster, we have the pods. They are the application that’s running. They are a group of one or more containers such as Docker containers, and they are running on the worker nodes, EC2 in our case. In every cluster there also is a control plane. It’s assigning the pods to the nodes and also provides other cluster management services. The kubectl is the Kubernetes CLI. With the kubectl we can see which codes are running on the cluster and also launch new workloads. The term “operator” exists both in Airflow and in Kubernetes, but they are quite different. In Airflow, an operator represents a single task. The operator determines what is actually execute when your DAG runs. So for example, we have the batch operator that executes a batch command. On the other end we have the Kubernetes operator, with additional extensions to Kubernetes, And it holds the knowledge of how to manage a specific application. For example, we have the postgres operator that defines and manages a PostgreSQL cluster. So Airflow operator is not equal to Kubernetes operator. Another cool feature of Kubernetes is the cluster auto scale. At this stage, we have the Kubernetes cluster without any pods running on it. Then it doesn’t have working nodes either. When launching a new pod it gets distant depending state. And the cluster is triggering no to scaling, and new instance is joining the cluster. And my pod is running. When I launch another pod, it gets the pending state, too. The cluster is triggered in Dota scaling again, and another instance is joining the cluster, and my two pods are running. When my pods finish working, the cluster scales in again, and now I’m paying only for the cluster management fee, which is very cheap, in this case it’s $60 per month. So Kubernetes in a nutshell is a platform for running and managing containerized workloads. In each cluster there is one control plane zero to thousands of worker nodes, zero to tens of thousands of applications running concurrently. A Kubernetes operator is not equal to an Airflow operator. And the cluster can automatically scale out and in. Cool, so now we can start talking about Spark on Kubernetes. So from Spark version 2.3 0.04, Kubernetes is supported as a cluster manager. We are not paying any management fee per instance, we’re paying only for the instance itself and $60 per month for the cluster management fee, at Amazon case, which is very cheap. This feature is still experimental and some of the features that are missing and not complete. For example, we have the dynamic resource allocation that is lacking and external shuffle service, that is is not implemented. How can we submit Spark application to Kubernetes? We have two alternatives. The first one is using the Spark submit script that we all know. And the other alternative is by using the Spark-On-Kubernetes operator. let’s start with the Spark submit. So here nothing really out of the ordinary, other than the master argument I’m giving you the Kubernetes API server address, and the spark submit sending the SparkPi or definition to the Kubernetes cluster. And now I have Spark driver up and running and the driver is controlling the executor’s needs. On the other end, we have the Spark-On-Kubernetes operator which is a Kubernetes operator and it extends the Kubernetes API to support Spark applications native. So now Spark knows, so now Kubernetes knows what Spark application is, and it’s built by GCP as an open source project. You can see the link below. So with the Spark-On-Kubernetes operator, I’m writing my Spark application parameters and arguments as the Kubernetes object manifest, with the type Spark application. Now I can send this manifest with kubeCTL and the cluster gets it. We have the Spark-On-Kubernetes operator running on the cluster, and doing the Spark submit for us. And now we have the Spark driver pod running on the cluster. And again, the driver pod is controlling its own executors. Let’s do a quick comparison of these two methods. With Spot submit, it has Airflow built-in integration but this lacks the ability to customize outputs. You don’t offer easy access to the Spark UI, and we can’t submit and view applications with kubeCTL. With Spark-On-Kubernetes operator, it still don’t have airflow built in integration, but it has the ability to customize outputs. It offer easy access to the Spark UI and we can submit and view applications from kubeCTL. So we wanted to take one of the advantages of the Spark-On-Kubernetes operator, with Airflow. so we need to integrate them too. So we decided to take the road less traveled and built an Airflow integration with the Spark-On-Kubernetes operator. So now you can send a Spark application with Spark-On-Kubernetes operator from Airflow. We built, after we built this integration, we contributed back to their Airflow pencils project. So we can use it too. A special thanks is deserved to all of the Airflow committers and community that helped us along the way and gave us a lot of valuable mentoring. So let’s see how the Airflow Spark-On-Kubernetes integration looks like. So it consists of three components. And the first one is the SparkKubernetes operator, which is an Airflow operator, and it’s sending the Spark application to the Kubernetes cluster. We also have the SparkKubernetes sensor, at the same, so that pokes the Spark application and end successfully, it feels Spark application. And to do success or failure, if the Spark application fails. And we have the KubernetesHook is that helps Airflow to interact with the Kubernetes cluster. So what have we gained by building this integration? So first now we can submit Spark application with the Spark-On-Kubernetes operator from Airflow, and it’s built into the Airflow project so, with official support. The second one is security. We are saving the Kubernetes credentials inside the Airflow connection mechanism, which is encrypted. Also portability, we are using the same template as Kubernetes object, both when running format flow and when we are running over application manually for debugging. Then it’s Kubernetes native. It communicates directly with the Kubernetes API. So thanks, and now back to Itai.

– Thank you, Roi for this great explanation of Spark-On-Kubernetes and on the Airflow integration you built. So, going back to the common data pipeline pattern we started with, you can see the Airflow down here. And if we dig deeper into the architecture, you can see that basically this architecture hasn’t changed much. The only thing that has seriously changed is the fact that in our the data processing layer, we’re now running Spark-On-Kubernetes, but something is still missing. So what is missing? So we had to connect the dots and make it production ready. Remember I told you that we wanted to gain better visibility and robustness. So let’s see what we’ve done with that. So in terms of visibility, what we have is a dedicated Spark History server for each Kubernetes namespace that allows us to view the Spark UI for every running application in Spark, or for applications that already completed. Of course we needed metrics. So we sent both Spark metrics that are exposed via JmxSink, and system metrics that are collected using this cube state metrics plugin to . On top of that, we built the funnel dashboards that aggregate both the Spark and the system metrics. So we can see what’s going on in our clusters in real time. So we are developers and we obviously need logs. And in order to investigate logs we collect the logs using Filebeat and send them to Elasticsearch. And nothing in a production environment is completed without alerting. So we use Airflow callbacks in order to emit metrics to our monitoring system. And then we can trigger alerts whenever needed based on the values of those metrics. I also promised you to talk about robustness. So here are two tips in this domain. The first one is you can actually run a Spark application on multiple availability zones. That can be very beneficial when using Spark Instances, because it allows you to increase the availability of those instances, and decrease the probability of losing instances mid execution of your Spark application. This is very suitable when your Spark applications have low shuffled volume. The other thing is the AWS note termination handler. So it allows Kubernetes is to gracefully handle events such as easy to spot interruptions and it can slightly increase the stability of your cluster. This is open source. So you can take a look at the git report here and install it on your cluster. So the benefits for migrating to Kubernetes are the same as we discussed or what we started with. So you can save about 30% in your costs, because there was no additional cost per instance, like we have in EMR. We saw that we can get better visibility and make our systems more robust using the tips I mentioned. As per the Airflow integration that Ori worked and contributed, then this will be available in the upcoming release of Airflow, which will be Airflow 2.0. Now I’m sure many of you can create. So we suggested that you check out the backport package that’s already available for airflow 1.10.12 So basically we’ve showed you that with minimal changes to your Airflow DAGs, you can boost your data pipeline like we did. So you can save a lot of money every month. You can gain visibility and make your systems more robust. So just before we wrap things up, a few things we care about. Women in Big Data is a worldwide program that aims to inspire, connect grow, and champion the success of women in the big data and analytics field. There are over 30 chapters and over 17,000 members worldwide. Now everyone can join regardless of gender. So we encourage you to find the chapter near you on the Women in Big Data website. We also have the Nielson Identity center tech blog, where we write about the things, the awesome things that we do. Specifically during this migration process Ori and I wrote a two-part blog series about Spark Dynamic Partition Inserts. And how it behaves when you run Spark on EMR versus Spark-On-Kubernetes. Cool, that’s it for us. We thank you for paying attention and for listening, you can reach us out on Twitter and LinkedIn, and we’ll be taking questions now. Also, please don’t forget to rate our session on the Data and AI Summit website. Thank you everyone.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Itai Yaffe


Itai Yaffe is a Principal Solutions Architect at Imply. Prior to Imply, Itai was a big data tech lead at Nielsen Identity, where he dealt with big data challenges using tools like Spark, Druid, Kafka, and others. He is also a part of the Israeli chapter's core team of Women in Big Data. Itai is keen about sharing his knowledge and has presented his real-life experience in various forums in the past.

About Roi Teveth


Roi Teveth is a big data engineer at Nielsen Identity Engine, where he specializes in research and development of solutions for big data infrastructure using cutting-edge technologies such as Spark, Kubernetes and Airflow. In the past 6 months, he has been actively involved in open-source projects and specifically Airflow. In addition, Roi has a vast system engineering background and is a CNCF certified Kubernetes administrator.