Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator

Download Slides

Using a live coding demonstration attendee’s will learn how to deploy scala spark jobs onto any kubernetes environment using helm and learn how to make their deployments more scalable and less need for custom configurations, resulting into a boilerplate free, highly flexible and stress free deployments.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi and welcome, my name is Tom Lous, I’m a Freelance Data Engineer contracted at Shell Rotterdam the Netherlands at the moment. And I’ll be talking to you about deploying a Party Spark Jobs on Kubernetes with Helm and SparkOperator. It’s quite a mouthful, but we’ll explain to you, how I stopped worrying about deployments and increase my productivity. So why do we even want to run these Spark Jobs on Kubernetes the first place? There’s good alternatives already baked into the Spark Ecosystem, the most common one is running it onto a Hadoop cluster with Yarn, and you can also run the stand alone or even use one of the cloud solutions like Data Prog or Amazon EMR or something. Why do we even want to run it?

It actually seems like a pretty bad Idea to begin with, right? Because it’s not a ready to deploy platform, you have to develop a lot of scripting, a lot of configuration, additional modules, you need to Image registries, Operators and there’s a lot of more DevOps involved than just running your Spark Jobs on a normal cluster.

So before I answered that question, let’s take a step back as a Data engineer, I’m really focused on building, Data Solutions, data-driven solutions using the Spark ecosystem. I want to run my main focus is building these sometimes simple ETL jobs and sometimes more constant machine learning jobs but when I have created this final graph, that piece of software, I really don’t care a lot about where I run it. I just wanted to run around predictable and run in a scheduled fashion. And if I want to look at the results or the locks, I just want to be able to do it. So I don’t really care about the ecosystem that much.

but there is a problem with that because there is a challenge in all these ecosystems, because you have to be aware of what’s the Spark from the Lang Spark version that is currently available as a system, we run Spark 2.4, (mumbles) or 4.5, or even 3.7 or maybe (mumbles) like 1.6. These are all things you have to take into the grounds. Also, maybe other libraries are we had two version and can we run Scholar, can we run Pattern? And even if you have to decide it’s on the base platform versioning, this will change continuously because there’s a team upgrading and updating its best forms to newer version continuously. Also, you have to take an accounts. How do you want to run the Spark Jobs? Because just to run something on Hadoop, you need maybe some best strip to run the Spark Job and you have to inject that with secrets and keys and locations and where do you even store to JAR file and all these pieces reduce you’re a very stable and very finely crafted piece of software into a big pile of technical depths. So is there a solution?

Is there a solution for this? Well, yes, there are ready to go platforms and I think Databricks Delta is one of those solutions we could use and there’s other solutions you could use out of the box that has all these components nicely packed and all the schedulers for you but wouldn’t it be nice if there was a way to just bundle all your dependencies configuration into one system and just deploy it on the go. Well, yes, of course it’s Kubernetes and in quote, unquote, ordinary software development, there’s already widely spread widely used, it’s a very good way to pack all your dependencies into small images and deploy them on your Kubernetes cluster. So are we just late to the party with data engineers? Yes and no, it’s one thing to just deploy some microservices, but a whole nother thing to deploy a distributed processing engine on the Kubernetes, there’s a lot of configuration involved and a lot of things you have to take an account, like how do I deploy an executors and how do I communicate to my executors and how big they’re actually gonna be want to see, to use how much memory? So there’s a lot of things you have to configure to make this work. And instead of explaining to you how you can do this, I want to show you this in the demo. So to deploy a Spark Jobs in a demo, we were gonna need the Kubernetes cluster and for this purpose, I’m gonna use a Minikube, it’s just an ordinary Kubernetes cluster and a nice thing. It’s running a local machine, so it’s nice to try this out yourself. Now we’re actually gonna create a Kubernetes cluster. we used Minikube start commands, to start Kubernetes cluster, we use the kubeadn bootstrap (mumbles) or we give it a bit more CPU memory than defaults because we actually don’t want another Spark job or Kubernetes cluster. And we want to place the JVM and other interesting part is the insecure registries, sub commands, which actually allows us to push and pull images from the Minikube registry for use in the Kubernetes cluster. The only thing we still have to do is enable this

so when it’s done, we’ll implement the addon

and yep, here we go. And we enabled the registry and we can actually see if everything is up and running, and we see our Kubernetes is up and running and we’re ready to go.

So now we’ve seen how we set up our basic Kubernetes cluster, and now we actually have a want to build a Data Solution. So I’m gonna show you how to build like basic Spark solution, It’s not the interesting part of this talk at all, but it will be running on the Kubernetes cluster in (mumbles). Come to the least interesting part of this presentation is to application itself, we just want to have an application that is running some busy work, so we can actually see the Spark cluster in progress for this using the MovieLens 20 million record datasets it’s movies like Toy story, Jumanji and ratings for each of the movies. What we’re actually gonna do is in this BasicSparkJob, we’re gonna grade this SparkSession, we define an inputPath where to read the movie files from an outputPath for the targets sparkai file, that’d be gonna generate the average ratings, Rita movie datasets from the movies to CSV. We’re gonna read it, their rating datasets from ratings CSV and then we’re gonna do join a broadcast, groupBy, aggregation and some repetition to really make Spark run on it’s a tiny cluster the end we’ll just do a count and also write all the data into a parquet file where we get the average rating for each movie.

So to actually run this job we actually need to define, or build a despatie and we are gonna run it on sparkVersion 2.4.5 And we don’t need a lot of dependencies, the only dependencies is spark-core and spark-sql

as you see, we say provided because we’re not gonna bundle all the spike (mumbles) in this project, we’re gonna actually use an external base image where we gonna put (mumbles).

So go into Data immediately, at this decided to put the base image also in this repository, but normally you would store it outside.

We have this Dockerfile and just to speed up the process, we’re gonna immediately create this Docker because it will take some time and I’ll go over it with you.

So for this Docker Image, we’re gonna use the basic image from SparkOperator, you’re gonna use any image, but the nice thing about SparkOperators come spin, stop with some scripts that make it easy for a SparkOperator to deploy your Spark Job, so that’s a good base to start. But actually I’m wanting to not just use the Spark, two 11, but the scholar two 11 library spot scholar two 12 libraries. And for example, I wanted to use the Hadoop Version 3.2, instead of the Bundled Hadoop 2.7. So this gives you flexibility to install and configure all the dependencies you want in your base image, and you can reuse as a base image for other jobs as well.

And other interesting part is you can actually also be configure your (mumbles) out of the box, and there is this little buck currently in the Kubernetes client that you can install this additional Jars to patches, but in the future version, you can just remove this and you older jobs keep running on the older version of this Docker Image and your newer version can run the newer versions. And the Enterpoint is Colts, which can be used by the SparkOperator, but we’ll get to that.

As you can see in the meantime, we actually… The script I called, it’s nothing fancy, it’s just some information about building this Docker Image,

which is happening in background and at the end, it gets pushed to the registry, which in our case is the Minikube registry. So in the end of this, we should see that the Spark runner of ours, the 0.1 version is not available in Kubernetes registry. So that’s great, so we have our base image, we have our application and now we just have to build our application and put them in their base image. And you can actually do that organize with me because you can use the Docker plugin for SBT,

it’s a SBT Docker. And when you have that, you can actually configure your entire Docker Image how you want, you can say what the name should be and how it should other than that affirmation, but the most important for this define your Dockerfile. And our case, it’s pretty straight forward because we’re just gonna use this Spark in a base image and we’re gonna add artifact and what is our artifacts it’s actually FAT Jar that we can build from the (mumbles). So in our case, there’s not gonna be a lot of extra libraries because these are martyrs providers and there are some other (mumbles) but you could, for instance

pro stress libraries into this FAT Jar or some other third party libraries that are not gonna be part of your base image. So I’m running this, I’m gonna also

create a Spark Image and I’ll show you what’s happening in this script. It actually does nothing more than just calling SBT Docker, but it will pass the Image registry information from the Minikube. So it pushes to the right version and it will target based on the version that’s available and SBT and push this immediately. So it’s normally you would see the scripts as part of your CICD Pipeline but for now we’re gonna run this from this small batch script in a Minikube, you see it’s pretty fast, the compilation happens pretty fast and now it’s pushing and the NCC that our image is now also available in the Kubernetes register industry.

Now this is the intro to the last piece of the demo. Well, for us to actually deploy these images on the Kubernetes cluster, we need to wait to deploy them as (mumbles) and easiest, adding other workload and Kubernetes and for that people have created a SparkOperator. So we go to install the SparkOperator right now but before we can do that, we actually have to create some namespaces. And the first one we’re gonna create is the SparkOperating in space where the Spark operaater just go live and the other one is gonna be a Spark Apps,

where we can actually deploy or Spark workloads. The reason we keep these separated because we’re gonna give to the SparkOperator, some elevator privileges to create and destroy parts in Spark Apps namespace, technically it’s not necessary, but it’s best practice, I would say. So we’re gonna create the Service Account well, called Spark. So it’s not very original and we’re gonna give these service account some elevated privileges

using the clusterrole and we call this the clot Spark operatorrole. We give it added privileges in the namespace of Spark operator. So the rules have been graded. So our next step is actually to install the SparkOperator in the Spark operating space for this we need the incubator repo and because it’s not yet released as

the mainstream so, update to get latest version, or we already had them apparently and now we can actually install.

the spark

Spark operator. It gives the name Spark again, not very interesting. But we use skipper CRDs. I think there was a book in the version, I don’t know if it’s still present, but it still works. We installed in the Spark name for operating this space and we enable workbooks. So we can connect with the SparkOperator from outside. And we make sure that SparkOperator will deploy all its applications in Spark apps namespace, log level, it’s just for the buck purposes.

We should be seeing the Spark

and it’s already done. So you see the webhook for the Spark freighter, in it has completed. And the SparkOperator is now up and running. So now we have to switch to actually deploying, so build deployment Spark application deployments, for this, we need home charge. Well, that’s the easiest way, at least. So the most important thing is that you want to deploy the Spark application. The moment you’ve deploy this in Kubernetes SparkOperator will trigger and deploy the cluster based on the specification you provide here. As you can see, there’s a lot of conditional logic here and the reason is that we keep this template as generic as possible where the I use our fields by, the information that is present in the chart and values files that are combined into one Helm chart.

So for going back, you can see we had go to data Scala for instance but if you specify PullPolicies or PullSecrets, or even make Class or application file, it will get picked up and rendered into the templates. And the SparkOperator recognized the specs and uses them to deploy the cluster. And also important is for the driver, how many cores does it have, how much memory and also for the executors and of course, which image is gonna be used. So a lot of information for this comes from two different files, but actually in our case three different files. So you start with the chart and the chart has actually some meta information, but we just give the name, we give the version, some description and the values, it gives you a more detailed configuration for which part version to use, which image to use, where is the Jar located in the base image and which main class should be run. But as you can see, a lot of this information already exists with one on a project, because these are all configuration files. We define in our building with Data SBT and if you noticed you see that these files are actually generated by the bills for the SBT. So if we go into this one and you see a graded this small version and actually does nothing more than creating these two files, as you can see them based on the information that’s already present and the advantages because we call this function every time you do a Docker Image it will render the correct chart and the correct values based on the current image. So you don’t have to keep track of that you update the same version in both your chart and your SBT and the main class name is still the correct one. And the Jar is in the right location because these are actually coming from the SBT that generates them. So this is a pretty advantage but the only thing we haven’t defined in this how to generate files is how to run it. How many CPU, how many memories, and this is the interesting part, we actually created the values-minikube, where we can actually, for this specific environment, we can configure this. So you can see we defined that our image registry as localhost 5,000, which is our minikube registry, we use the spark-Spark ServiceAccounts, which we just created and we point to some volumes that we still have to mounts, but these are very specific to my minikube environments. Also, not a lot of course, or memory for the SparkJob because it’s just a small cluster. But you can imagine that for a production environment or cloud environments, you would increase this value and have the arguments come from some Kubernetes Secrets and have an outside image registry. So you can, in theory, define all these files for each environment and have the see the CICD Pipeline bundle the correct environment with the right values and have your chart repository, charts are ready.

So this is pretty cool, so these values are generated this values use manually enter, but now we have this Helm repository but how do we create a deployment out of this?

We can do Helm package and stored somewhere, but actually you want to have a Helm repository, helm registry so to see, where you can push this to. Some image registries offer these out of the books, I know for a fact that the Azure ACR actually both has a power to store normal images, but also Helm charts. Unfortunately the imagery entry in minikube doesn’t, so we actually need to run a Basic Helm, ChartMuseum. And that’s pretty cool because that’s actually a normal that’s a Docker Image that you can just run. And it actually has some API points to retrieve your chart

and some API Punch to push your chart. So it’s fairly straightforward, I just have to make sure that we are using the minikube environments and we can just do Docker run

and this will downloads this version of the ChartMuseum I think this is the latest and called ChartMuseum 8080. And I’m storing it locally, but you should do something more permanent when you actually deployed something like this, or you sit outside of industry and we can actually see that

at the moment we have this charter it’s running with no entries.

So the final step has come, we have dockerized all image, and our next step in CIC pipeline will be then to bumble this.

So we’re gonna publish the chart and I’ll show you what’s happening in the background.

That’s pretty quick.

So if we run this command again, you see the chart has been uploaded, what actually happens, it’s pretty okay. So what we said before, it’s actually, we combine this Minikube values with the values. There is no good way to do this using Helm commands at the moment. So this is some makeshift go to make it happen but in the end, it’s just nothing more than at this chatmuseum repo

and push this data, push just a combined chart of the minikube to the chartmuseum and you can imagine if you run this in CICD Pipeline, you would based on environment, use of different values and push them to the correct chartmuseum.

We can actually do now and home this, I think chartmuseum should be part of it right now. So if you do help me go up,

we should have all the… We should be able to deploy.

So now we have oral components in the Kubernetes cluster, we have images in the image registry, we have a chart in the chartmuseum, and now we just want to deploy our application and this is what I’m gonna show you next. But before we deploy, we have to do one more thing and as you might remember, is that we have these two mount points, input-data and output-data, that are not pointing to anything right now, what would be useful is actually to use a minikube mount commands to points the local dataset, ML 25 directory to input-data and I’ll use the opportunity to keep it active in the backgrounds and minikube mounts

to some local parquet file for the output data. So I think should be empty right now but in the end there should be some data presence. So Helm chart has updated, the images are updated, so the only thing that we just have to do is install this Helm chart. I’m gonna use the upgrade commands because it will keep me to run this command continuously every time I have a new version, we go at the movie transform. I’m gonna use the latest graphic transform movie ratings, I’m gonna run it in Sport Apps and I’m gonna install it. So it’s installing right now, so we should be able to see something happening.

So it’s already running, so normally it starts to drive first and that will trigger the executor to be getting frustrated. So I should see this pretty fast, what’s happening in the background. We can do some port forwarding to see what’s going on using Spark line

and you can see this is SparkJob running on top of our Kubernetes clusters that’s pretty awesome. It’s doing the counts at the moment if you look at the executors you actually see the two,

this is the driver that will start at the two executors and the drivers actually you’re not doing anything of course and actually (mumbles) one active dos at the moment, as you can remember, the executor’s only have one gig of memory and one CPU core. So they’re not doing something fast so you can see the counts and here you go see the Movie Database like almost 30,000 records getting loaded and the 20 million refuse to get bras because his joints, some aggregation and any enter will get stored.

See what the progress is or application,

you can still running,

probably almost done by now.

So we should be getting some data in okay.

You see, Oh, it’s already done, so check again,

yeah, it’s completed. We can actually inspect always the lives of the driver if we want to. And you can actually hear this a lot of debug stuff from the entry points of our base image but here actually or Spark Smith actually starts and here’s the first outputs starting Spark UI, reading, writing now 26,744 four records,

and it’s done and it start Spark (mumbles). So this is pretty cool. So actually see if it’s a work as expected,

so we can do sparkai two

we see a do a row count on this movie ratings and we should get what you see 26,744 as expected how we can actually look at some of the

ratings. And you see

some movie has an average rating of 2.5 based on two ratings and Hope Springs has average rating of 3.25 number of rating 136. So this is not the interesting part of course, but you see we actually deployed a SparkJob on Kubernetes right now. So that is pretty cool.

Success everything works as expected, so that’s pretty cool. It seemed like a lot of work to begin with, but at the end, we created a very easy arobust way to deploy our Spark Job using Helm charts. And you can imagine that the moment we have this all in place and up and running and correct CICD Pipeline, it will be very easy to just make minor changes to your environments, change your base libraries to change (mumbles) and just deploy new versions of your Spark application without even have to worry what the underlying platform is. Now, we’ve seen how to deploy this we’ve deployed manually. So what are the next steps? You could imagine building a schedule on top of this, we could use Kubernetes scheduler or airflow or some other way to deploy these jobs in a diamond fashion, actually you can run airflow pretty nicely on Kubernetes as well, which we are doing at Shell. You can also think about upgrading your Kubernete systems to use outer scaling plus locals. So you can actually scale up your classes pretty big and scale them down when the resources are not needed. And there’s a lot of other things you can do to improve this Spark Job in this way. We haven’t even touched monitoring or logging or alerting, but it’s all minor steps from when you have this deployed aleady. Here are some links about the things I talked about, so there’s links to SparkOperator Helm. Some of the codes that are being used in them is already available. Get up, you can read it and use it and try it yourself.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Tom Lous

GraphIQ, hired by Shell

Tom is a freelance data and machine learning engineer hired by companies like eBay, VodafoneZiggo and Shell to tackle big data challenges.

He has been building Spark applications for the last couple of years in a variety of environments, but his latest focus is on running everything in kubernetes.

Besides Spark and Kubernetes, Airflow, Scala,, Kafka, Cassandra and Hadoop are his favorite tools of the trade.