Cost Efficiency Strategies for Managed Apache Spark Service

Download Slides

Today with cloud-native rising, the conversation of infrastructure costs seeped from R&D Directors to every person in the R&D: “How much a VM costs?”, “can we use that managed services? How much will it cost us with our workload??” , “I need a stronger machine with more GPU, how do we make it happen within the budget?” sounds familiar?

When deciding on a big data/data lake strategy for a product, one of the main chapters is cost management. On top of the budget for hiring technical people, we need to prepare a strategy for services and infrastructure costs. That includes the provider we want to work with, the different tiers plan they have, the system needs, the R&D needs, and each service’s pros and cons.

When Apache Spark is the primary workload in our big data/data lake strategy, we need to think about rather we want to manage it ourselves or work with a managed solution such as Databricks.

Azure Databricks is a fully managed service rooted in Apache Spark, Delta Lake, and MLflow OSS. When discussing the best ways to work with Apache Spark, performance, and tuning comes to play. Databricks provides us with a managed, optimized Apache Spark environment ~50 times faster than OSS Apache Spark. But we need to consider the costs carefully.

Join this session to learn about resources consumed with Azure Databricks, the various tiers, how to calculate and predict cost, data engineers and data science needs, cost efficiency strategies, and cost management best practices.

Speaker: Adi Polak


– Hi everyone and welcome to our session today we’re going to talk about cost efficiency strategies for managed Apache Spark services. My name is Adi Polak and I work for Microsoft under the Azure engineering org. You can find me on various social media channels, you can ping me on Twitter, on LinkedIn, my messages are always open there. And if you’re interested to the read what I’m writing, I’m writing on different platforms like Medium, info Q, D zone, and many more. All right, so what are we going to learn today, we’ll start with a little bit of motivation to understand how cost integrated into everything that we do today. We’ll go over a set of tools. We’re going to look into the cost model of Azure Databricks and Azure Databricks is pretty similar to how AWS Databricks Cosmos works. So you can take a lot of information from that as well. We’re going to look into different cost optimization strategies, and then we’re going to wrap up with a few tips that I have for you. Alright, so let’s start with a little bit of motivation. Why should I or should you care about cost? We definitely see recently with the pandemic with the great lockdown that we are currently existing in the world the acceleration of digitization. And in order to do it fast a lot of companies are adopting the cloud. So a lot of companies also are kind of abandoning what used to be on-prem. And that brings a set of challenges in the shape of cost. When we used to work on-prem our engineering manager, or R&D or CEO or CTO would usually order all the hardware a year in advance. So they would already predict how much workloads they’re going to need for the different services that the company is building and ordered the hardware and make sure there is enough space for the hardware in advance. And today, we already know that we can’t wait a whole year to get new machines. We want to deploy it with a click of a button. And that’s absolutely possible together with the cloud. So how does it all work When we start engine reformulating. creating a new services. Usually there is the business side of Spark, there is some business idea that someone’s developed, it can be the product manager that constantly works with different customers and engineering. It can be the CTO, it can be maybe one function of the CEO to understand the market and see where the gaps exist. And usually that person will go back to the R&D, and they will start a pre authorization process. They will start discussing what the different services that are available. Is that the real priority that we want to develop for the people that can develop that set of capabilities and how they enable everyone to be successful with all doing that. And after that we’re going into the design and building. This is where us the software developers come into play, we are designing the high level of the architecture, we’re writing the code, or planning all the different tests that we’re going to do. And this all called HLD high level design document that we’re creating. And this high level design document that we’re creating usually consists of the requirements for the specific service. How should it be looked like whether the logic inside what it gives to the customer, how we enable it to run fast. The different features that exist in it, the architecture, more specific how the different components talk, whether the use cases, maybe our software architecture will be there as well. We’ll need a test plans because we need to understand how we’re going to test that services. Another good point is security. It’s do we need security? What the level of security that we need to integrate when you’re designing that already. What is the point? Are we working in a CI CD kind of world and we need to deploy it fast and we need to make sure we have a fast deployment cycles. One of the monitoring or the audit trails, how do we want our service in production to know that it actually works the way we expect it to work? Another point is maintenance. Now we develop but that’s great we have our first version in production. How do we maintain that version? Do we need to create more version of that? And how do we enable it? And last but not least that today we see more and more, it’s the cost. So the idea and the responsibility of course, for different services actually shifted slightly from the engineering director into us into our software developer. And the reason for that is because on the cloud there’s multiple services that we can work with. So they give us the responsibility to say, Hey you know what they set of capabilities that you can work with maybe you can kind of go and see what is relevant for you and what can be a good use. But you probably ask, “Why should I as a person, who works for a company care about the cost. I’m a software developer, I didn’t do any business school or finance school or anything like that? Well, there are a set of things you should consider. First of all, it’s good to understand how budget works. The P&L profit and losses, how it all operates. The more you advance in your career, the more you’ll be responsible for a different budget. Understanding the cost will help you to influence the technical decisions. Once how much a service costs you can influence different parties. And when you sit there in the middle room and you showing your design, you can show how you test it out different features, not only on how good they are in terms of delivering on what they do but actually on how much they cost. And if it makes sense to maybe pay slightly more for a service or pay slightly less. So you’re going to see all the eyeballs are going to be on you when you start having this conversation around cost and how we can cut costs, and what we can do to improve our services as well. And also, the last part, we definitely see a high rise with culturing the idea of financial accountability through all the lifecycle of the R&D. A lot of developers already get the responsibility of giving an estimation of how much their service is going to cost in the cloud, how much is going to be deployed, what are the workload is expected to have and also what are the different services and infrastructure and how this cost is being split. And this brings me to the high level of P&L. And P&L, profit and losses, this is a bigger picture that we can see. And our section really goes into the cost of goods. How much do I pay for hosting my services? How much do I pay for third party vendors? And this is the piece of the cake, I like to call it that we are going to focus on today. Now that we understand the motivation, let’s go into the tools. And in the cloud, we have many many services that we can use, we have Amazon EMR that gives us an HDFS environment. we have the cloud data proc by GCP. It also gives us another environment to deploy open source Apache Spark. We have Azure HD inside from the Azure Cloud, that gives us the opportunity to deploy Apache Spark. And then we have the data bricks environment, that gives us the opportunity to work with a managed Apache Spark that is being optimized. So it’s not the open source solution of Spark, but it’s an optimized solution that Databricks gives us. So how does Apache Spark and the cloud computing delivery model works? In the cloud computing delivery model, we have the IaaS Infrastructure as a Service, this is our VMs when we want to deploy things on VMs. It’s also included mostly Kubernetes. because Kubernetes helps manage the different containers we want to deploy on the Iaas. There is the PaaaS the platform as a service, things like HD Insight, EMR, cloud data proc. And then there is the SaaS, the software as a service. This is what the Databricks gives us. It gives us a unified platform to run or manage Apache Spark workloads together with more different workloads. And this is something we need to keep in mind when we think about what kind of spark exactly do I work with when I work with the cloud. And again, it depends on the level and the delivery model that we decide to work with. And more tools to help us predict what the cost that we’re going to pay. We have the different pricing calculators that all the big cloud providers give us. So we have the Azure pricing calculator right in there, we have the AWS pricing calculator and the GCP Google Cup Platform pricing calculator and these tools are going to help us predict how much we are supposed to pay depends on the different workloads. So this is one of the tools you can take and use in order to understand what the workloads cost for your specific solution. And then we need to be aware of how we organize our resources, how we control and the report and also different attribution of costs. So we have the report of the billing. And for that we have Azure cost management that we’re going to see in a second. It helps us really understand where the money goes, and what are we paying for. And we need to be aware that it’s up to us to organize a different resources and subscription to be able to provide and be able to query through the cost management in an efficient way. This is an example of how Azure cost managed services looks like. You can look you can see right in there that although it’s an Azure cost management, it runs on Azure, you can see your AWS workloads right on the top where it says elastic AWS. And then there’s HD insight as well. So this is an example of how I can have a hybrid solution and manage all my costs from the Azure cost data center. And we can see how much it cost me per region. So if I’m deploying to multiple regions I can have the visibility to understand the different regions. And I can also have the availability to understand the different subscription group and resource group that I’m deploying. So being mindful of how we’re structuring and what the different resource group that we’re giving in order for us to be able to query it later on. Going to help us understand where the money went, what was the cost and how we can break it down. And the last piece around tools is to understand the billing model that most clouds take. So we have the pay as you go, basically I’m signing up for specific services, and I’m paying for the amount that I’m consuming. And the second one is the enterprise agreement. Let’s say I pass through the different tests. And I already know I want to build my architecture with specific services. And I don’t really know what’s the different workload so I can actually cut a little bit more into cost ’cause I can come to the club provider and tell them, “Hey, you know what I want to have this deal. For one year, two year, three years.” And this is the enterprise agreement specifically for Azure, you can see deals for one year and three years, they’re already been building inside the website, so it’s easy for you to find, and you already know how much you’re going to pay for. Now, we’re going to look specifically into the Azure Databricks cost model. And like I said in the beginning it’s very similar to how the AWS cost model works. When we look at the different spark workloads that we’re running, we have basically two main components we need to compare. We need to understand it for going with IaaS we’re going to deploy it onto our VMs or Kubernetes, or kind of an HDFS or do we want to go with the managed spark services that it’s more optimized and this is the data bricks. And we when we considering this too we need to understand a few points that are not necessarily directly related. The first one is the team knowledge and the team size. if I’m going with is or if I’m going with a managed HDFS, solution past service. Do you need a bigger team? Do I need more expertise in the in the team? Do I need to have engineers and DevOps that are experts? Specifically on Kubernetes? Do I need optimization expert? Many of the time we’ll see performance engineers and that’s something thats highly used, especially when you when we have an on-prem solution. We need them to understand what’s happening in the cluster and how we can optimize the different workloads so we can actually have more services there. So do we need this experts? We need to understand how to run the machine. How the machine operates. How do how do we deploy different containers into it and how we manage all of them. We need to understand the network and also the storage. And then the other side we have the managed solution where we need to understand Spark. We need to know how to work with Spark. But this might be a different level of expertise, they might not need to be an expert on optimization. They should know optimization and how spark internal works, especially if you work with a high load of data. But do we really need the people that knows how to tweak the little problems. And what is the size of the team that we need? Do we need a big a big team? Do we need a lot of DevOps? Do we need a lot of different data engineers for that? Or do we need more, we need a the people that can actually write the logic and not deal with the infrastructures. Still, with the Azure Data bricks and gather race devotion, we still need to understand a little bit how VM works and to be able to better utilize them. We need to understand the network because usually you don’t want work only with that Databricks for you might connect it to different solutions as well that available and different services available on the cloud. Storage still important to know because this is what we do when we work with high scale compute it means we have high scale storage as well. And this introduced another new thing called DBU as the data bricks units that we need to understand how it works in our cost model. When looking at the high level of the Databricks, different services that are being consumed we have the DBU’s The data bricks units. We have the virtual machines. The virtual machines that we decide to work with. We have the public IP address for the environment. We have the storage that we work with. Managed Disc if we decide to work with a Managed Disk, and then we might have bandwidth, if we have different communication between different regions. We might get bandwidth cost as well. And that all brings us to the Plan Tiers. So how does it work in an Azure? In Azure Databricks, we have two tiers. Also in AWS by the way. The one is the standard, the second one is the premium. Let’s take a deep look into what each one of them gives us. We have the standard plan features. So this is all the standard features that we can consume, like the Manage Mlflow or the Managed Datalake, or interactive cluster if you want to do that. The job scheduling as well, and a lot of different others that we get directly with the standard version. And then we have the different Premium Plan Features, which we know we already get the standard plan features. And on top of that, we get more features like optimize auto scaling of compute, or different security features as well. So I kind of took the liberty to try to kind of make sense out of the services and other the extra features that you get when you work with the premium. So we have the one that touched specifically on performance. And that’s the optimization of the compute workloads. We have the one that touches specifically on security. And that’s the role based access that we might need. If we have some notebooks or jobs or tables or clusters or anything that we want to make sure is locked for specific users. This is where we will probably need the premium here as well. And the last one is the monitoring. It’s true, we have the spark history server. But maybe we need an extra audit logs to analyze and process. So all the features that gets into here, we need to understand if we need them and if we can live without them. So basically, if we want to go with the premium, or if we want to go with a standard. And after understanding that we need to think about the Databricks units that exist there. Yeah, you saw them right there in the table, you saw that they’re providing different features on different levels. But now it’s time to understand what they actually gives us. So we have the data engineering light, the data engineering, and then the data analytics part. The data engineering light gives me open source Apache Spark out of the box. So what I’m getting, I’m getting the schedule jars, the Python, the spark submit jobs and this is basically the only thing that I’m getting, What the Databricks light does not support. So it does not support a data light out of the box, it does not support autopilot features like auto scaling, it doesn’t support the different notebooks and Connection to different data source tools and BI tools, and how the way we configure it basically as when we start a cluster under the runtime environment, we can go to the light runtime, the light Databricks. And there inside, we can start our cluster with a light Databricks runtime. And this will actually tell the data resource that we work with use that one instead of using the data engineering way. And this decisions has influence on the way we’ll look at cost because we have the standard, we have the premium, and then we have the different DBUs Databrick’s unit that we can consume. So you can see that the engineering light cost slightly less from the engineering in the analytics cost slightly more than the engineering part. Analytics what it gives us It gives us the interactive approach. So when we want to work with different notebooks, and we’re doing different workloads we would like to look into the analytics one. Let’s look at some workloads and what we want to do on more than at a go. So let’s say we’re working with a scheduled job. So this usually will be the task of the data engineer to create. So we know data and the engineering part or the engineering life can work well for us. Let’s say we have an on demand job. So it’s it’s your your job. So this might be the work of a data engineer, or this might be the work of a BI who now needs to create a report or maybe even an analytics person that needs to have the report as well. Another workflow that we can look at the exploratory and the more interactive workloads that we have. Let’s say I want to run a specific query on data because I want to check out some customer and they want to understand it right now. This is where I need all the power to make sure my query will return as fast as I can. And the people who usually going to use this is the BI folks and the machine learning folks because they want to explore the data, they want to get results fast. And they want to understand how they’re building it for. Now that you understand a little bit about the different workloads, and you understand how the GPU works and how the different costs works. Let’s take a look at the VMs and how they work together with DBU. So when we work in the cloud we have multiple VMs that we can decide on. We have the general purpose VM. We have the memory optimized, we have the storage optimized, we have compute optimized, and there’s a bigger lesson that you can check out in the blog post in the link. And each of that actually cost us different DBU units. So the general purpose one cost me one and a half for every for one machine that runs for one hour, I’m going to pay one and a half DBU. And for the compute optimized, I will pay two units of the DBU per hour. So we need to understand that and this is something that comes we take into our equation when we work with different VMs. So once we understand the VMs that we want to work with, we need to put that as well in there. And that’s going to impact our cost and of cost calculation. So let’s look at one scenario that we want to break down. let’s say I have 400 VMs. The hours is one hour, I want to know how much it cost me for one hour. I need four chores in my VM and I’m going with a general purpose one. And I want to see how it all cost me. So I put together the equation of the number of VMs double times how much it cost me per hour, plus the different runtime that I’m taking. So if I want to work with the engineering or the engineering light times the DBUs that I work with, ’cause I know I might need different types of DBUs. And I want to make sure I’m also getting in the performance factor right in there. So I want to calculate the performance factor. And from the calculation, I create this graph, so I can try and see where is my sweet spot. So I know when I work with the engineering DBU, there’s a good chance for me to get a better performance, ’cause it’s not just a bunch of spark open source out of the box, it went through multiple optimizations. So there is a good chance that I’m going to get a good optimization right there. And I want to see what exactly does that mean. So if my performance factor, let’s say it’s 10 percent, then I know it’s still cost me a little bit more to use the engineering. But if it’s already 15%, then I already had the sweet spot and I know that from here, if I will get better performance, it’s going to the cost is going to be lower, or they can get more jobs into the hour to them already paid for. So this is something to consider when you’re deciding on the different runtimes and when you’re deciding how you want to work with the DBUs. And if you want to work with optimized solution, or non optimal solution, even outside of Databricks, let’s say I want to work with AWS, EMR or HD Insight. When I know it’s not optimized I want to see how long my job take on this on these machines versus when I work on the Databricks environment ’cause I want to take that into consideration that also will be calculated in my costs. So the previous example we saw was for the standard tier. Now let’s look at the premium tier and it’s again, the same calculation, same number and machine, the cost is slightly different ’cause I’m looking at the premium one. And now I can see that my sweet spot is slightly somewhere in between the 45% improvement. So I can think and I can bring that into consideration and say “You know what? Premium if I have more than 45% of improvement in the runtime, it makes sense for me. If not, then not” But we already know the premium gives us a more optimized compute. So it has a it has a good chance of actually giving us this optimization. So we know we already can have the pay as you go or pay in advance with for years. So we can have that pay in advance with one and three years and we can do it two. We can pay in advance for a VM so we can also pay in advance for the different views that we work with. And we also need to understand what the runtime and framework could be using. And the optimization that Databricks gives us. so we know already there’s multiple optimization that exists in Datalake. We know there’s different optimization in PySpark in Pandas UDF. It’s not necessarily only in databases can also be in the open source. But one sweet thing to consider is the photon engine which is basically an engine that is written in low level code that helps us run everything faster using code and this is something that exists only on the Databricks environment. The third thing is to consider and understand is the file system storage. when we use the DBUtils directly from the Databricks environment it uses something called RAGRS It’s read access, geo redundant storage that it’s mostly expensive. So you might, you should consider configuring your own storage right there, instead of using the default one. Let’s go quickly on the tips and this is also going to be a wrap up for the session today. So we already understand we need to somehow manage our spending. we can understand if it’s per subscription for a managed group, per resource group, or should we enable some Quota alerts as well. Most of the clouds enable us to create this Quota alert, which actually helps us with navigating understanding what’s going on in terms of costs. we know that we should enable the different auto scale. So scaling machines up and down automatically can really help us get more workloads in the machines that we already have. And the last thing is to think about VMs and think about how he worked with our VMs because sometimes you would like to pay for an optimized compute VMs just because it might mean that instead of 400 VMs, we’re going to be good with 20 VMs, for example, for the different workloads that we have. Alright, thank you so much for listening today. If you have any specific questions that you’re not comfortable with asking in the chat, then you can always reach out to me on Twitter and on LinkedIn and ask them in private. I hope you learned something new today and I hope you take a deeper look into the different cost models and how you can optimize the different workloads for you and for your team.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Adi Polak


Adi Polak is a Sr. Software Engineer and Developer Advocate in the Azure Engineering organization at Microsoft. Her work focuses on distributed systems, big data analysis, and machine learning pipelines. In her advocacy work, she brings her vast industry research & engineering experience to bear in educating and helping teams design, architect, and build cost-effective software and infrastructure solutions that emphasize scalability, team expertise, and business goals. Adi is a frequent presenter at world-wide industry conferences and O'Reilly courses instructor. When Adi isn't building Machine Learning Pipelines or thinking up new software architecture, you can find her hiking and camping in nature.