Hybrid Apache Spark Architecture with YARN and Kubernetes

May 28, 2021 10:30 AM (PT)

Download Slides

Lyft is on the mission to improve people’s lives with the world’s best transportation. Starting 2019, Lyft has been running both Batch ETL and ML spark workloads primarily on Kubernetes with the Apache Spark on k8s operator. However, with the increasing scale of workloads in frequency and resource requirements, we started hitting numerous reliability issues related to IP allocation, container images, IAM role assignment, and Kubernetes Control Plane.

To continue supporting growing Spark usage with Lyft, the team came up with a hybrid architecture optimized for containerized and non-containerized workload based on Kubernetes and YARN. In this talk, we will also cover a dynamic runtime controller that helps with per environment config overrides and easy switchover between resource managers.

In this session watch:
Catalin Toda, Software Engineer, Lyft Inc.
Rohit Menon, Software Engineer, Lyft Inc.



Rohit Menon: Is your organization planning to use Spark on Kubernetes? Or planning to migrate workloads from YARN to Kubernetes? Are you wondering what could be the issues that you would run into if you run Spark on Kubernetes at scale?
Hello, everyone. My name is Rohit and I’m a software engineer on the data infrastructure team at Lyft. In our presentation today, my colleague Catalin and I will be presenting our experiences running Spark on Kubernetes at Lyft over the last couple of years, the challenges that we have seen with the K8s model, and how we believe a hybrid model that brings in the advantages of both K8s and YARN could be beneficial.
Now, let me briefly walk you over our agenda today. First, I’ll be giving you a bit of context about Spark and Lyft. So who are our users of Spark? How is it used? Next, I’ll be sharing a bit of evolution of our Spark architecture over a couple of years, and how we kind of arrived on Spark on Kubernetes today. Next, I’ll be sharing the major challenges that we see with the K8s model. At this point, my colleague, Catalin will be taking over and sharing the hybrid approach that we have come up with, with Kubernetes and YARN. Next, we’ll be covering some points about some internal tooling that we’ve built for image management, conflict management, a bit about Spark Operator and how we use it. And finally, we’ll be sharing our progress and future plans.
So what’s Spark used for at Lyft? So Lyft is mostly a Python shop. And a lot of this goes back to the extensive use of Python in our micro services. Of course, over time, this is things and we’ve adopted a lot of [inaudible]. But what it means is that our infrastructures team doesn’t really support Python well, in terms of monitoring, logging, development, environment, and whatnot. And as a result, we see a lot of adoption for Pyspark. We do have certain use cases in Scala. But those are mostly very specific. And with the need to extensively use cases within Spark.
In terms of where we run, entire Lyft infrastructure is biggest in AWS. And this is true for data platform too. And we use S3 as our permanent storage. For our users, we support an interactive development environment. And when I say users, these are data scientists, data analysts and data engineers. And this is basically based on Jupyter and Spark on Kubernetes. And in this case, we use the Spark on Kubernetes client mode.
Another major use case for us is the ML batch use cases. So what this means is basically a lot of data prep for model training, feature service, things like that. And some of the example projects are like pricing, routing, mapping.
Moving on, another major use case is around data engineering and ETL. And this starts right from event ingestion, the pipeline, which in this data from our streaming sources, to our data warehouse with low latency. In fact, event ingestion is one of our biggest Spark workloads. And I’ll get into more details of how this actually makes an impact in our architecture design.
After event ingestion, we have numerous downstream ETLs, which are also using Spark, Pyspark and Spark SQL. And finally, there are some certain specific use cases like financial data sets, sales, compliance related which use Spark because of reasons around permission management, which is quite flexible with the Spark model that we have. And finally, a use case that’s coming up for us is migrating our Hive workloads to Spark SQL.
So now, let me share a bit of history about how we got today to Spark on Kubernetes and where we were in 2018. So in 2018, Lyft was mostly running on a third party service, the third party service, primarily supported YARN. All of our Hive and Spark workloads used to run on YARN clusters. Specifically for Spark. The way the third party service kind of enabled onboarding is that every major use case per customer would have their own YARN cluster.
This was immediately a concern. Basically, what it meant is that every customer had to maintain their own bootstrap scripts for their clusters. These bootstrap scripts would include things like Python dependencies, or native dependencies at certain times, or they might need a special version of Python, or a special version of Spark at times. So this was a lot of variables, which made it a bit difficult for teams to use Spark. Also, from the infrastructure team perspective, we did see an explosion in the number of clusters that we had to support. Of course, this adds to the management overhead for an infra team and also, it gets difficult to provide a high quality of service for our customers. And finally, another limitation that the third party service also had was that these clusters were based on EC2 machines. And if you wanted a customer angle for your use case, what it meant is you had to create an own cluster for yourself.
So again, all of these new ones, there was a lot of friction to adopt Spark, at Lyft. And we really didn’t see a lot of adoption in 2018. Moving on, comes 2019. And things start to move towards Kubernetes for us. There are a few reasons and I’ll highlight the top three reasons why this happened. So Lyft infrastructure team basically blessed Kubernetes as the way forward for stateless services. What this meant is that infrastructure team like data platform now had the opportunity to utilize Kubernetes to run stateful workloads. Maybe the whole challenges are different in terms of stateful versus stateless. But still, we had an effort in supporting Kubernetes deployment. So this was a game changer to help us onboard onto Kubernetes.
The second major point was around Google open source to Spark on Kubernetes Operator. And this started gaining maturity, and was really working quite well in open source. And the third reason was around flight. So flight is an open source container native scheduler. And this really took off it Lyft four ML use cases.
So along with the flight model, and Google open source Spark on Kubernetes Operator, what it meant is that in this new world, when a customer wanted to onboard to Spark, they would be having their own project, which is within Lyft map to a single one is to one GitHub repo, you would have your own Docker image with your own dependencies. And this was a much easier model as compared to bootstrap scripts, which are not really well tested. And finally, along with support from the infra team, we could also have per job IAM roles.
So if you look at the state, where we were at the end of 2019, and comparing this with 2018, a lot of the challenges that we saw with maintenance, and IAM roles, and all of that were solved pretty easily by the Kubernetes model.
So moving on, this is a high level architecture diagram of a Spark on Kubernetes architecture. So this is a pretty standard deployment on the Lyft. Mostly you can see it’s per clients. For us, this is mostly flight. And we also have books from other clients to flight like for example, from airflow, you could check off a Spark job on flight. This would then interact with the Kubernetes clusters, create a custom resource definition for a Spark job. The Spark Operator is listening on these CRDs and then will kick off a Spark job, which is, in this case, the Spark driver and the executors all running on Kubernetes.
Come to 2020. This is when it started getting interesting for us in terms of scale. So a couple of things happened in open source, we started seeing more and more velocity towards Spark on Kubernetes features that were missing as compared to YARN model like dynamic location, shuffle service, all of these proposals are out and yet we were quite convinced that we have elegant solutions for this by end of the year, or like early 2021. So we definitely felt like a good idea to invest in Spark on Kubernetes. Also, another thing happened, which is internal to Lyft, but basically we moved from the third party service to running our Hadoop clusters natively on Kubernetes. And we ended up moving all of our high end workloads to the YARN cluster that we run natively.
For these young clusters, and based on our requirements, we also did a lot of feature work to support auto-scaling based on resource manager load. So we had a solid YARN infrastructure ready and we had a couple of options for the Spark on YARN workloads. Should we move these to automated YARN infrastructure? Or should we move this to Spark on Kubernetes? Looking at our experience based on Spark and Kubernetes, how it worked well, and how the community was moving, we decided, “Let’s move this over, all the Spark workloads and let’s centralize on Spark and Kubernetes. As we migrated and as we started on building more and more workloads, we started in hitting numerous limits and challenges.
Initially, I spoke about an example use case around event ingestion. And this is the biggest workloads that we have. It is extremely spiky, it runs thousands of jobs every day. And this model doesn’t really work well in Kubernetes, when you have sudden Spark can require a lot of ports or containers to come up. And some of these could be extremely short-lived. So we had to start thinking of certain Band Aid solutions to make it work. We had to group jobs, so that there are less concurrent jobs running on these clusters.
Also, we saw issues with certain cases where if a user would run for a particular user, or a particular team who run a lot of jobs, how it could impact other jobs by adding a lot of load on the control plane. And we ended up adding new Kubernetes clusters. So we kind of went from multiple YARN clusters to Kubernetes clusters, not a one-to-one mapping. But still, we were again, in a model where we were running multiple Kubernetes clusters.
To add a bit more color to the scale for Spark today at Lyft, we are in the order of thousands of concurrent jobs. We have our workload spread across five Kubernetes clusters. At peak, we see around 20,000 executors. We are in one AWS region, but we use five zones within the region.
So now let me talk to you about some of the major challenges with the K8s model. And before I start, I wanted to add an external note that some of the challenges are specific to Lyft infrastructure. And some of those are specific to also just Vanilla Kubernetes. And these could be important considerations for your organization. Also, in terms of where the Kubernetes support is. And because there’s always a thought of whether it’s suitable for stateful workloads versus stateless. And you might need to optimize for different things. So this is just an extra mode, while we go over this challenges, and why we described the solution, what do we think works well for us?
So let me first cover the IP address shortage problem. So we are aware that Kubernetes, in its recent versions, has now started supporting Wellstar and IPv6. Lyft internally runs its own Kubernetes clusters and doesn’t use a managed service. Out of the box, it runs with a Kubernetes version that supports IPv4.
We constantly run into challenges around IP availability. And what the impact of this is basically drivers and executed delay during startup. If you see on the right is a graph, which shows the number of containers that run into this scenario, where they basically have to retry to get an IP. And if we look at how frequently this happens, we see order of thousands of ports that turn into the switch, could either be a driver or an executed port. So this is the first challenge that adds to the delay.
The next is around IAM. So in this new world where everything is running on Kubernetes, we were able to support per job IAM role. But what this meant is there was obviously an explosion of IAM roles. And also, this IAM role assignment obviously comes with a cost. And this cost is not only on the Kubernetes control plane, but we also add extensive load against the AWS server. And we would see throttling, and we would see some stability issues. So this is another reason where we will see delays. And in fact, we had to come up with a custom solution where our ports would have a startup container which waits on IAM assignment. So this again added to the delay.
The third point is around infrastructure issues. So this goes back to the discussion around stateful and stateless workloads. So because Lyft hand flow was mostly tuned towards stateless workloads, there was some considerations that needed to be changed when we were developing stateful workloads on these clusters. For example, using operators on this cluster does add a lot of load on achieving the impact of a bad Kubernetes node on a stateless workload is very different from a stateful workload. And we had to handle it with retries. And how do we better triage these issues while they happen?
So moving on, the next set of challenges was around image management. And this is a challenge which might not be obvious when you actually adopt Spark on Kubernetes. But as you start scaling our workloads, it is a very important consideration to have.
So the model for us, along with slide and with Lyft and for what we had is basically we had per project images. And when I say project, it’s either a customer or a use case or a team. These images get registered for different environments. What this means is that we supported images for a dev environment, staging production, so you can have different code in different environments.
And then, again, this was an explosion of the number of images. And with this came additional deliveries. How does this work? So Kubernetes, does cache images. But if your job starts running on a new node, the image is not going to be available. So now you have to wait for the image to be available. Also, these images contain code dependencies park. So it’s not like the really small image. And the other factor is also, if there is a new release of the image, a new shaft that comes out. So now that is not going to be available. So this does, again, play a big factor in the overall delay that we see for job runtimes.
On the right, we have the P95 delay per project. And what we would see is that in certain cases, we would see totally addition of execution time by more than an hour just because of delays with images, IPs, IAM roles. And this, of course, leads to wasted resources.
Another challenge is around the release module. So some of this is specific to Lyft, but could be, again, relevant to how you do image management. For us, we have infra managed base images. And an example of this, even not going to image with like security batches. On top of this, the data platform team managers and Spark image, the Spark image comes with the base image plus Spark code base, plus internal tooling, which we’re going to discuss later. And then comes customer images, which depend on our stock images, which contain their code, their custom dependencies, et cetera.
So within Lyft, we have an expiry for images of 30 days. And what this means is that there is a possibility that some of your changes don’t actually land in production up to 30 days. So this really would slow down infrastructure team, in terms of how fast can be land our changes in production. Of course, you could work with the customers to see, but that is a lot of hand holding. And this goes not only to the internal tools that we build, and make it part of the images, but also changes to this barcode reach.
So this is another challenge that comes with the image model. Obviously, using images, has its own pros, it’s repeatable. But also, this does become an additional overhead of managing and reusing images for them.
Some more challenges that we saw. So the first one is around creative scheduling. We are aware that there are certain projects like Qualcomm that are trying to address this and we have not explored them enough. The way we approach this problem along again working with the flight team is basically we have one namespace per project. And then we have a quota namespace. So as soon as you hit the quota, the job start retrying, getting delayed. Also, these quotas are started. It’s not like if a particular namespace is underutilized, the other could kind of do it. So that’s another issue in terms of overall usage and user experience.
The next point is around priority. It is very well possible that there are a lot of small jobs, blocking a high priority job, or jobs within the same niche business competing with each other, and not completing listing, producing content. So these are issues around K8s scheduling. If you see on the right, we have a graph for Spark Submit Container Failures. And this is basically an indication of the number of containers that fail during submission because of quota limits.
Moving on control plane limits. So this is mostly around like, let’s say, the limits around how much the Kubernetes cluster itself can scale in terms of containers, for example. So in our current setup, we see that we hit the ceiling. And obviously already and start seeing quite a lot of delays when we get about 15,000 containers. What this means is that you have to really think through the scale of your Kubernetes cluster, and then decide how many Kubernetes clusters you’re going to need for your jobs.
Also, like we mentioned before, shortening containers is not very typical of a Kubernetes workload. A Kubernetes pod is quite expensive as compared to a JVM process. A Kubernetes pod has basically, a lot of overheads in terms of IP, IAM, things like that. So that is again, an important consideration.
And finally, a major factor that actually made us question a lot of this was around Hive depreciation. So at Lyft Hive is still majority of our compute, and we are planning to consolidate on technologies like Spark, and Presto. And today, at scale, what we see is about 5000 jobs a week for Hive. So if we would move these workloads over to Sparks SQL, what we are expecting is about a 6X increase in our overall workload. And based on the challenges that we mentioned earlier, we don’t see the Kubernetes model, basically working for us at this scale.
Another factor is around the way we have designed this model using Spark Operator and how does it not basically support interactive usage? We are actively looking into projects like Spark server to make this possible, but this is also an additional consideration of how the startup time for a job is very different in the Kubernetes model where you are impacted by so many different continents.
So this was mostly all the major challenges that we see with the creators model. Now, I would like to hand it over to my colleague, Catalin, to walk you over The Hybrid Approach with YARN and Kubernetes.

Catalin Toda: Hello, my name is Catalin Toda, and I’m a software engineer in Lyft working on Spark and Hive infrastructure. Previously, I worked as a production engineer in Facebook supporting the Spark workloads. And before that, in Hortonworks.
I will start discussing how we come up with a hybrid approach, and then describe the implementation details. Finally, I will present our progress so far. And the key takeaways.
When I say hybrid approach, I’m referring to using multiple resource managers, such as YARN or Kubernetes, at the same time, and routing dynamically between them. What we noticed in our Kubernetes approach was that we were paying a high cost of starting the executors for all our workload. The executors would come up late, the jobs would be delayed, customers not having visibility into why that happens.
The first thought was, how do we stop paying the high cost of bringing executors up for the workload that does not have special requirements. Before figuring that out, we had to separate our workload by workload type. In the past, we kept assuming all our workload is either non-containerized in 2018, and 2019, or all of it is containerized in 2020. In fact, when we analyze the current and future workload, we saw that how the vast majority is non-containerized. Using a containerized approach when we saw scale issues, determiners to think about the hybrid approach of running non-containerized workloads in YARN.
The non-containerized workload definition for us is SQL, Scala, and Pyspark using standard list of dependencies, such as by arrow or pandas. Pyspark with complex dependencies or machine learning workload fits very well the containerized workload, which will continue to run in Kubernetes. This ensures that the scale for Kubernetes workload is small enough to not cause issues to our infrastructure.
The other reason for using the hybrid model was to not disrupt the existing Kubernetes workload nor to engage in costly migration that would lead to customer dissatisfaction in the short-term. The cost of the solution was small considering we already had Hadoop deployment for YARN.
Now, let’s review the architecture of the hybrid model. Previously, flight was our only entry point. The existing non-containerized workload would keep using the same path as before. Flight would submit the new CRD to the Spark Operator, which would end up creating the Spark driver pod, which uses Kubernetes control plane to create the executor pods.
For the containerized workload, we have added a new gateway as seen in the diagram. The gateway would be able to submit and watch the execution of the Spark CRD in the same way that flight does. To accommodate the YARN execution, we chose to keep that Spark driver still running as a Kubernetes pod. However, unlike the Kubernetes model, where each customer has their own namespace, in this YARN mode, all the drivers share the same namespace. The driver would still have memory and CPU limits as before, preventing noisy neighbor problems.
In Spark pod, before the execution of the driver, the master will be changed from Kubernetes to YARN, and the Spark configuration will be amended for the execution environment to be as the application expects. The executors to be allocated in YARN, the shuffle service would be used and the application would complete successfully. The actual rewriting of the Spark configuration would be performed by another tool, which I’ll describe later in the slides.
During the execution of the entire workflow, the Spark Operator will still be able to monitor the status of the Spark driver, as it used to be before and report it back to flight or to the gateway. This workflow will not be applied at the runtime for the containerized workload running in Kubernetes.
In the next slide, I will discuss the advantages of the hybrid model. While the containerized workload, we do not see a big difference when running in Kubernetes. Most of the improvements are specific to non-containerized workload with executors run in YARN. The benefits are fast executor startup time. Creating a new process inside of the node manager is very fast. We do not need an IP address nor images to be downloaded on the local node. The YARN distributed cash helps with distributing the YARN jars and Pyspark modules very efficiently.
Using fair scheduler in YARN is better than the default namespace quota resource management. For example, in the Kubernetes model, if the namespace is over quota, not even the drivers are admitted. This leads to poor visibility over a job that keeps retrying to get the driver report over a long period of time.
In hybrid solution, all the queuing happens in YARN and the resource allocation model is much cleaner. If the query is wrong, for example, we know the answer right away rather than waiting for the minutes to get a driver report. The Python workload without dependencies would also enjoy the same benefits highlighted above. In addition to that, the iteration speed was improved. While for containerized workload, a Docker image has to be created on each code change. Using YARN, the Delta can be packaged automatically and the job can be submitted right away.
There has been a lot of discussion about the shuffle service and using shuffle managers. But on the current Kubernetes implementation, lost shuffle files cost stages to be retried, which adds up to the wasted compute. We feel using an external shuffle service is still an upgrade compared with the executor serving shuffle files. However, we appreciate the effort of the open source community that has been done in Spark three one to improve this aspect.
Moving on. Now, let’s talk about how each of the component used to make the hybrid model work. We have talked a bit about the Spark Operator before. We have been using the multi-version branch from the beginning. With a driver running in cluster mode, we noticed that the overhead of this pod constitutes about five to 10% to the resources wasted in the cluster.
Additionally, the job latency is increased because of the impact of non cash images, and that is much higher. The image will have to be downloaded once for the driver in cluster mode. Then once for the driver in client mode. To reduce the job latency and job wastage, we have invested in running the driver directly in client mode for both YARN and Kubernetes execution plans. This change will be contributed back to the multi-version branch. Keeping the Spark Operator otherwise unchanged, allowed keeping the compatibility with flight and with other open source tools. In this way, we can upgrade the Spark Operator seamlessly based on open source changes. As the Spark Operator does not control the final resource manager, we can now switch between Kubernetes and YARN at runtime.
In this slide, I will talk about a new tool that we have created to assist the routing between resource managers and Spark configuration management. I will describe the design of the new tool and its functionality. If you remember in our previous slides, we were talking about containerized workloads, having their own image and releasing at their own pace. This could lead up to one month release time for an image. If for example, we push the bad change to the Spark wrapper. This will force us to deal with this change for up to a month while this change is still in production.
Of course, the other way would be to reach out to all the customers and have them upgrade to the latest image which would lead to poor customer satisfaction and wasted efforts from our team. The decision to separate the deployments of this tool from the image itself was made. As a result, we came up with a two-stage design, where the first stage would execute the second stage at runtime. The stage one would be part of the base image and deployed at the customer base. It would be invoked after the Spark Summit, and before the JVM process running the driver.
The functionality of stage one would be simple and remain unchanged over time. Figure out the location of the stage two in S3, download the stage two from S3, run and lock the zip code of the stage two. The stage two would contain most of the functionality that our team would want to incorporate into it. The Spark wrapper would be released when needed and ensure it is tested correctly. The main functionality would be related to figuring out the Spark configuration for the specific job, running the Spark driver, capturing logs and metrics and sending them back to Lyft infrastructure for dashboards, alerts and so on. It integrates very well with our in-house tools and services.
More in detail, few of the things we found useful being managed by the Spark wrapper, manipulation of configuration for environments. For example, the staging environment would have different configuration than the production environment. Dynamic management of configuration. While some of those settings can be static, and defined in configuration repository, we found much bigger flexibility controlling them at runtime. This makes also maintaining and changing the configuration much easier. The downside is that those configuration option need to be well tested before releasing them using integration tests. As discussed before, changing between resource managers, or roll out Spark conflicts incremental, capture logs of the driver and metrics and send them back to Lyft infrastructure. Push application metrics to the data warehouse. The benefits of this tool are twofold, running alongside the driver in the same container and managing the configuration dynamically and keeping the Spark Operator very clean from Lyft specific logic.
Moving on, and let’s talk about image hierarchy and distribution. As highlighted before, image management requires very careful consideration. This is one of the main things that your organization needs to evaluate. Few criteria that we think are relevant. Is the Spark code released often? How the releases are reverted. How does the infra test their images, and how the customer test their images.
Our base image contains the Spark binary in-house matrix manipulation and the Spark wrapper. The deployment of the base image would ensure that all the workflows and scenarios in which the customer are using Spark are well tested. The aim is to have a base image for each minor version that is being used to make sure that switching between versions is tested well by our customers.
The containerized workload or Kubernetes native, will extend the image and add their own code and files to it. That image will be tested also by the customer. And going through all the deployment steps until reaching production. The non-containerized workload will use the base image created by our team. This image is being maintained and updated by us while the customer maintains only their code. Using multiple versions of this image allows us to carry a new Spark version just to a subset of customers. This allows the non-containerized workload to get much faster Spark code updates and helps us battle test it before reaching the containerized workloads. In this way, the chances of reverting the code is smaller.
We strongly suggest for your organization to figure out what the image three will look like. And how do you plan to propagate image updates. Again, a large number of images will slow both the data infrastructure and the customers maintaining it and cause caching issues on the Kubernetes cluster.
Next, I will describe our progress into implementing the hybrid solution. To recap a bit the non-containerized workload that runs in YARN, we said that where the driver pod runs in Kubernetes. Sharing the namespace with all other drivers. Only one image is used. We are piloting this workload to a different Kubernetes cluster, then the containerized workload. In normal circumstances, the driver report started in less than one second, since the pod is created by the Spark Operator. Occasionally, the image is being downloaded from the Container Registry, but the image size is one. We use the YARN fair scheduler, and customer base cues. The resource management happens in YARN and is transparent from Spark infrastructure point of view.
Running a large subset of jobs in YARN led to a reduction in Spark scaling Kubernetes caused by the fact that the executors are no longer running there. Requesting IP addresses for drivers no longer causes ports to wait or fail. For us, in aggregate every job allocated 20 executors in Kubernetes native mode with dynamic allocation enabled. We do not have yet good production data to see what concurrency we can expect in YARN with the same settings. To overcome the IAM role issues, We switch all our use cases to use AWS web identity provider. For YARN because the driver is the Kubernetes pod, we can use the web identity hook as it is. The improvements that were done, such as Spark Operator using client mode, or a web identity provider benefited both the containerized and non-containerized workloads.
Using the same path for both Kubernetes and YARN allows us to not engage in migration for containerized workloads. We looked into Docker on the admin as well. But the thought of another migration prevented us from moving our workloads there. We may reevaluate our decision as part of the next iteration.
A utility was also created to aid customers to share best practices across different use cases. From the infra side, we used the library to ensure that all workloads have a similar experience when it comes to all Python dependencies. As mentioned before, this is important as three environments running Spark jobs, the Jupyter environment, containerized and non-containerized workloads.
Future plans. The focus of the data infrastructure team in Lyft in the next period is Hive deprecation in favor of Spark and Presto. This will bring several scaling challenges, but we are confident that the migration will benefit greatly from the Spark ecosystem. Furthermore, we are looking into evaluating data lake technologies such as data lake and Eisbaer.
Lastly, we expect large organic growth in the next period for Spark. We will evaluate how our infrastructure will fare considering those challenges. We are confident that our current hybrid solution will prevent scaling challenges until the next iteration.
Conclusion. The key takeaways of our journey are workload type analyzes. Your organization needs to evaluate what is the dominant Spark use case. For us, SQL, Java and Pyspark with our dependencies are the dominant part of our workload. If this is true for your organization as well, we strongly suggest looking into a YARN-based solution. If you already have a YARN environment, we believe that this is the best way at this point to not run into Kubernetes scaling issues. If your workload is mainly ML or workload with complex set of dependencies, then Kubernetes execution mode makes the most sense.
Docker images are probably the best way to abstract the customer requirements away from the infra. In this case, we strongly suggest evaluating flight as an orchestration solution, as its Kubernetes native. In Lyft, having an already existing YARN infrastructure made the hybrid model an easy choice, that may not be the case for your organization. If you think about going ahead with a Kubernetes model, have a plan for the issues we are seeing currently.
IP address challenges, we do believe IPv6 will solve this issue. While there are a few solution based on IPv4, it all depends on the scale your organization runs Spark at.
Control plane challenges. Ultimately, short running containers is not really a good model for Kubernetes due to the large overhead on creating containers. IAM images, IAM roles and so on. This can lead to boottstrapping new nodes that will be released shortly after increasing the infrastructure costs. Combined with a maximum limit of objects of around 15k that we’ll be using safely in production. The only solution is to have a gateway that is able to distribute the workload across multiple Kubernetes clusters. As the Spark workload can be long-running, the routing would need careful consideration.
Image management. Ultimately, reducing the number of images is going to go a long way into getting the drivers and executors start faster. An additional issue encountered by us is on the Spark version upgrades. The number of customer images required to be updated is high and time-consuming. The overhead for the users maintaining the images is also high. As mentioned earlier, this is one of the most important things to consider when designing your next generation Spark infrastructure.
Quota. While this mechanism can be used for production runs, and the model is different than the YARN fair scheduler, we recommend looking into additional solution that can solve this problem in Kubernetes, such as Volcano. We had great success with AWS web identity provider for our workloads. We highly recommend looking into it for your workloads as well. For YARN, the driver report running in Kubernetes allowed us to use it out of the box.
If you are passionate about those topics like we are, please reach out to using the email addresses below. Thank you for watching this presentation. Now, we are ready to take your questions. Thank you.

Catalin Toda

Catalin Toda is a Software engineering working on the Data Platform team at Lyft. Catalin has been working on defining the Spark infrastructure Lyft uses to support an increasing number of jobs rangin...
Read more

Rohit Menon

Rohit Menon is a Software Engineer on the Data Platform team at Lyft. Rohit's primary area of focus is building and scaling out the Spark and Hive Infrastructure for ETL and Machine learning use cases...
Read more