Today, general-purpose CPU clusters are the most widely used environment for data analytics workloads. Recently, acceleration solutions employing field-programmable hardware have emerged providing cost, performance and power consumption advantages. Field programmable gate arrays (FPGAs) and graphics processing units (GPUs) are two leading technologies being applied. GPUs are well-known for high-performance dense-matrix, highly regular operations such as graphics processing and matrix manipulation. FPGAs are flexible in terms of programming architecture and are adept at providing performance for operations that contain conditionals and/or branches. These architectural differences have significant performance impacts, which manifest all the way up to the application layer. It is therefore critical that data scientists and engineers understand these impacts in order to inform decisions about if and how to accelerate.
This talk will characterize the architectural aspects of the two hardware types as applied to analytics, with the ultimate goal of informing the application programmer. Recently, both GPUs and FPGAs have been applied to Apache SparkSQL, via services on Amazon Web Services (AWS) cloud. These solutions’ goal is providing Spark users high performance and cost savings. We first characterize the key aspects of the two hardware platforms. Based on this characterization, we examine and contrast the sets and types of SparkSQL operations they accelerate well, how they accelerate them, and the implications for the user’s application. Finally, we present and analyze a performance comparison of the two AWS solutions (one FPGA-based, one GPU-based). The tests employ the TPC-DS (decision support) benchmark suite, a widely used performance test for data analytics.
Roop Ganguly: Hello everyone and thank you for attending. My name is Roop Ganguly. I’m a chief solution architect at Big Stream Solutions. I’m going to be talking today about hardware based acceleration for Apache Spark. And before we get started, I just want to outline the objectives of the session. And they’re really twofold. One is to bring hardware acceleration to everyone’s attention and let folks know that it’s here and it’s usable, definitely triable for your Spark applications. And the second is, as you get deeper into it, it’s important what architecture of acceleration that you choose, because there’s hardware implications that bubble all the way up to your Spark application. So that’s what we’re talking about today.
So let’s start with why we’re here in the first place. And in one word it’s performance. I’m not going to belabor the point. Data engineers, data scientists. I’m raising my hand. I’m a former data scientist about five years ago, but basically we all want performance. We want to use bigger data sets longer, look back in our analytics, more sets of data. And this need is ever-growing. That’s what we’re seeing, both in our marketing research with our customers and sort of concurrently so to speak. The issue is that CPU based approaches to meeting these performance demands are not quite hitting the mark. There’s a number of reasons for this and we’ll talk about those. I think one of the primary ones is that Moore’s law has actually been slowing down for over five years now. So CPU’s are not getting faster as fast as they used to. And so consequently, what we’re seeing out there in the user base is customers are missing their SLA, having tough times meeting their deadlines, getting cut off from data sources that they want to use. And so this is an important issue, maybe one of the primary issues in terms of moving forward with the Spark analytics.
So that said, let’s start with what the approach is today. And as we all know, what we all run our Spark applications on is CPU clusters. That’s the most prevalent way to [inaudible] and Spark does a fantastic job of distributing computations, distributing data, as we all know. And so it’s become very, very popular. There’s a huge adaptation rate. And what happens when we want to get more performance? Well, the most common way we address that is with cluster scale-up or cluster scale-out. So scale-up refers to making the individual nodes in the CPU cluster more powerful, adding more cores, adding more memory. Scale-out refers to adding a number of servers to our clusters, and they can obviously be used in conjunction, but what we’re seeing, and there’s two primary drawbacks to this.
One is it’s costly. These servers are not cheap. They’re actually getting more scarce these days is what I’m seeing in literature. Even on the cloud, in the managed environment, Azure and AWS are charging premium prices for larger nodes and larger clusters in a lot of cases. The other problem that we see is that it’s typically sub linear in terms of performance improvement, meaning that if you get ten times more the hardware, ten times more to the servers, you don’t get even close to ten times the performance. There’s a number of technical reasons for that, including IO, scaling, network scaling, L2 cash contention. I won’t get into all those details. So we’ve come up with other solutions in today’s environment. One category is I call code optimization, basically writing better Spark code, right? Putting a filter before your joins, avoiding redundant table scans. Included in that I’ve loosely included cashing approaches.
So there’s ways that if you draw your data from certain locations, certain URIs, the system will automatically cache that data for further use for locality. And then the other probably most familiar one is Spark can figure out the optimizations, right? So the number of executors, number of cores per executors, executive memory, driver memory, all those things can be played with to optimize the performance of your application. Another sort of CPU based approach I’d like to talk about briefly is software based acceleration. So what is that? That’s basically under the hood implemented Spark tasks by native code. And this runs on CPU instances so there’s no need for additional hardware. Big Stream has developed software based acceleration. It’s seamless, it’s zero Spark user code change.
But all of these approaches really compliment the scaling of CPU clusters. I don’t want to make it seem like acceleration is a replacement for CPU clustering. In fact, it’s complimentary to CPU clusters. So software based acceleration is a way to accelerate without the advanced hardware we’re talking about today. And just wanted to show a very quick result. This is a set of business intelligence queries called TPCDS that we ran on standard Spark 3.0. And then with the software based acceleration provided by Big Stream, just a quick plug that is available today on marketplace. If you search for Big Stream, you’ll find it there. So what we’re looking at here is the speed up numbers. How much faster does this acceleration make your Spark code run? You can see the parameters of these runs. This is a four node cluster with standard CSV data. And you see we can get some pretty impressive performance improvement just with software alone. So even without the specialized hardware we’ll be talking about today, acceleration is a viable way to get performance.
But our subject today is to take performance to the next level. And that’s where hardware acceleration comes in. Hardware acceleration, in a nutshell, involves running our analytics on what I’m calling programmable or specialized hardware. So field programmable gate arrays of PGAs, graphics processing units, GPUs, application specific integrated circuits is another approach, asics. And all of these hardware choices are designed for efficient execution of specialized code, meaning that the hardware adapts to the actual application. So contrast that with a general purpose CPU, which is programmable, but doesn’t necessarily morph to adapt to your specific application, right? So these are all interesting hardware approaches. Asics typically support domain specific workloads, hence the name. I’m not going to be talking much more about asics, but they are in use for analytics and particularly AI today. But the two leading technologies we’ll be talking about is FPGA’s and GPU’s. Now they provide another level of flexibility of programmable hardware, right?
And they provide the efficiency that we’re talking about, the efficient execution. The issue is, and why isn’t everybody using this type of hardware for Spark right now, the issue was that they don’t natively connect to any big data platforms, right? So middleware is needed to make that connection. The way Spark downloaded from the Apache site, it runs on your CPU cluster, right? That’s not true, necessarily, of these hardware [inaudible]. But if we can bridge that gap, both of these approaches can provide performance and significant power and cost advantages and actually, again, compliment CPU scale-up and scale-out to make it more cost effective, more power effective. So that’s the goal.
The other thing I’d like to bring up today is that the hardware acceleration market is trending. So this is happening. People are using this. And some evidence that I want to present is some interesting research done by Ark Invest. This is a recent article that they wrote. And if you look at, going from in the next decade, what they’re saying is that the accelerator market in terms of hardware is going to grow by many, many factors. And in fact, on top of that, it’s going to actually exceed the expenditure over that of CPU’s. That’s their prediction. And as you can see in the last bullet there, this is driven by us. This is driven by big data analytics, AI, data scientists, et cetera. But I want to be clear that this research, it deals with the hardware expenditure, right? It’s tracking the hardware expenditure in terms of the market. Nothing is sort of said here about the software that makes this stuff usable. And that’s really what I’m going to be talking about today. But the point of this graph is that the accelerator market is taking off and it’s a trend we need to watch and take advantage of.
So I’m going to give a high level architectural comparison of these two technologies, GPU and FPGA. Again, this is at a thousand foot level. I’m not a hardware engineer. Again, I’m a data scientist, but I think that’s the correct level to take a look at it and start thinking about acceleration for analytics. So looking at the top picture of the GPU, basically what you can think about is that it’s just a number of capable compute lanes populated by basically CPU’s that are more specialized. So they have a reduced set, but they’re very, very efficient. There’s a reduced set of operations that they support, and you just have a very high number of them packed onto this hardware platform, right? So the benefit of that is that you can get a very high degree of data level parallelism. If you have data that you can compartmentalize into these different lanes, then you can get a ton of parallelism and a ton of performance out of that architecture.
The other interesting thing is that, similar to CPU’s, the program will via an instruction set. And so it’s a more familiar programming model in that sense than some other hardware accelerator platforms. The challenges with GPU is that they use a Simdi, which is single instruction stream, multiple data set approach to computation. So each of these compute lanes is computing the exact same instruction set. And so the branch divergence, meaning ifs, conditional jumps, things like that can be very costly and can cause inefficiencies. And we’ll get into that a little bit in the next slide. The other thing is that we’ve done some experimentation and in reading some of the literature power consumption can be kind of high for these, for this highly parallel platform for some types of analytics. In the lower picture, you see FPGAs. FPGAs are basically configurable at the logic gate level.
So I’m talking your OR, gates NAND, gates XOR, basically that’s how you program an FPGA. So in that sense, there is no instruction set architecture for an FPGA it’s really at the logic level. And so that flexibility gives it certain advantages, which is basically that the logic can reconfigure per operation, and really maximize the efficiency. They can be very, very well tailored to the operations that you’re doing, and that also results in lower power consumption per computation. The other thing is that this issue with branches and irregular parallelism, heterogeneous parallelism can be leveraged on the FPGA. And another advantage that they tend to have is that there’s very high on-chip bandwidth between compute elements that are defined. But again, this really higher degree with flexibility comes at a cost, is that you really have to understand FPGA architecture and think about your application in terms of logic gates to take advantage of this. So that’s kind of the interesting piece with the FPGA.
So I mentioned about irregular computations, and this is just an example where the two architectures behave differently. And what I want to do is show how the architecture affects the application, right? That’s all this example is for. So if you look at the application code in blue, it’s a piece of analytic written in, let’s say scale up for my Spark application. It’s a retail application. And it says, if the shirt size is equal to large, then do the code in green. Otherwise do the code in red. And that’s represented by the green and red arrows for the confrontation in the legend. So in the GPU case, because of the Simdi, Single instruction stream architecture that GPUs inherently have, we have to split up the branches of the if statement into two separate execution streams that have to be executed at two separate times, right?
Only one instruction stream can be executing at the GPU at any given time. And so you see that the entire computation is split up into sort of two epics, whereas on the FPGA side, because we can configure the compute lanes to do what’s called Mimdi, multiple instructions stream multiple data, we can do them simultaneously if we have enough compute lanes. And that’s another issue where the parallelism in the two cases varies. But all I want to illustrate here is that there’s architectural differences that bubble all the way up to the application level. And so that can affect your performance. And these are the kinds of things. So we don’t need to learn how to program this hardware, right, as data scientists? What we need to do is understand the architecture and how it bubbles up into our application. What is our application doing? Which applications are going to run better on which architecture acceleration? So that’s really what I’m trying to point out with this example.
So just the high level set of hypotheses observations from someone with five years experience in the field. I’ve used this technology. I’ve done a lot of studying of this technology. This is one engineer’s opinion, but basically in terms of the SQL analytics ML operations, what I hypothesize is that scan is typically going to be better on the FPGA because of Mimdi versus Simdi. There’s a lot of if statements and scanning data, also decompression, which typically involves bit level operations. Your SQL operations, things like join, aggregate, even project, that’s going to vary depending on the irregularity of the operation. The more regular it is, probably the better it’s going to perform on the GPU versus the FPGA. For ML training, I think we all know this, the GPU is pretty ubiquitous and very widely used for training.
And the reason for that is because that typically is a very regular matrix operation. So things like matrix multiply, tensor product. And the other point is that they use for training. You need to use, typically, a floating point, high precision, multi word values to do those computations, and that’s amenable to the instruction set architecture that GPUs have. So that’s where it sees benefit. Inference is an interesting one. So you would think that with training being a GPU sort of specific, maybe inferences as well, which could be true, but it really depends on the precision that you’re using. So there’s inference algorithms to my understanding that are using even things like vector sets that are bits, logic, one zeros. And in those particular cases, because of the ability to conform to bit level operations, FPGAs is may have an advantage in terms of performance for inference. So what I’m trying to get out here is this is a taxonomy. We need to think about these kinds of issues when we’re accelerating, and then back it up to our application, our understanding of our own application to make it work.
So, like I said, this stuff is available today for data scientists to use. And there’s two technologies I’m going to be sort of presenting some results about. One is a GPU based acceleration technology. The other is FPGA based. They’re both available today on AWS, but I want to be clear that this is not a head to head comparison of performance. This is really trying to get the audience to understand the level of performance that can be gleaned from these acceleration technologies. So just to get into the experimental setup, really from a user’s perspective, we ran four node worker clusters on AWS. There’s eight VCPUs per worker, one executor per worker, seven cores leaving one core over for the OS. The baseline, in all cases of Spark 3.0.1 run on the CPU of the worker cluster.
This is again, a benchmark suite called TPCDS. For those that aren’t familiar, this is a set of business intelligence queries. They really cover the gamut of SQL operations. There’s about 104 of them. We chose 90 of them for various reasons, but I want to be clear that this is the standard TPCDS queries coming right from the website, the tpc.org. I won’t name names, but I’ve seen other studies that are using modified versions of these or proprietary queries. We really want to do it from user perspective. So we use the standard TPCDS code and then also standard TPCDS data in CSV format. However, it’s G zipped, which is usually the way people will store compressed data. And the other thing is that the data is all coming from the identical AWS S3 bucket. So again, this is not a head-to-head comparison because of the differences in terms of the run environment. And we’ll get into that.
So the name of the GPU technology, it’s called rapids, it’s provided by Nvidia, and you can allocate a rapids cluster via AWS EMR elastic MapReduce, which is the [inaudible] service. And so the cluster is comprised of G4DN.2X large, which is a standard GPU instance that has a single GPU in it. You can see the characteristics of it in terms of cores and memory. And what we did is we basically optimized the Spark configurations as recommended by the Nvidia literature for rapids. We did a similar thing with the Big Stream FPJ based Spark acceleration. Now F1 instances, which would be FPGAs in AWS, are not yet available in EMR. We’re actually working on that with AWS, but we’re able to allocate them using a Big Stream provided script in much the same way. It works almost the same as EMR in terms of ease of use. And you can see the characteristics of the F1 instance. So obviously it’s got a slower CPU, more memory. We again have optimized Spark configuration, but I just want to emphasize that we’re running on two different instance types. So there’s no real direct comparisons to be made between the two technologies, but we do get a sampling of the performance that’s possible.
So this is a graph showing speed ups again, comparing running a Spark 3.0.1 on the CPU versus including the GPU via the rapids interface. And you can see that you do get speed up on an average about 1.9x. I won’t speak to the details of how rapids works, but I can say that we use the standard TPCDs code. There was zero code change involved, and you see a spectrum of performance with the rapid solution.
So we ran the exact same experiment, but on F1 FPGA nodes for Big Stream, you see sort of a higher level of speed up, again, not a head to head comparison, but this is sort of the flavor of the acceleration that you can see with this solution. Again, I can speak to this because I know that there’s a spectrum of operators that we have implemented on the FPGA for acceleration. So you see a spectrum of performance levels for the different queries. So some queries are making more use of the accelerated operators than others are, and that’s why you see the spectrum. But the basic overall average speed up you get over standard Spark is around 3.5, 3.66x. And by the way, I want to add that there’s no slowdown with any of the queries. With Big Stream, what you get is default Spark behavior, if something is not able to be accelerated.
So I just want to bring out sort of how is all this possible? How can we accelerate with zero code change on this complex hardware? How can data scientists access this? And this is the way the Big Stream solution works. And the basic idea is that it’s, again, a software middleware that automates the process of acceleration of Spark. And the way it does that really briefly is that it leverages the Spark physical plan, which we call data flow. So basically we’re able to intercept that data flow and that data flow contains all the information about the application being run in Spark, right? We know that it’s got the schema, concurrency information, operator information, the whole shebang. So if we intercept that data flow, we’re able to generate accelerated code for this kind of hardware. So that includes CPU’s, right? We saw the software based acceleration. That’s just in native C++.
FPGAs, but we saw those results. There’s another platform called smart SSD, which is a computational storage, meaning there’s an FPGA attached directly on the storage platform. We’re working with Samsung on that product, and we’re taking that to market as we speak. And then roadmap is also GPU. So really the software is a framework for automating that acceleration process on to this hardware. So the key thing is that you can get, as you’ve seen, we’ve seen up to as high as 10x acceleration end to end, but I think equally as important and equally important for us data engineers and data scientists is the stuff in the upper right, which is zero coaching, right? We don’t want to disrupt our work processes to do this acceleration. And really that’s what these hardware acceleration approaches do provide us.
So in summary, hardware acceleration is here. We believe it’s taking off, especially for Spark for analytics. It’s available today, it’s on the cloud, it’s also on premise and it also offers zero code change. So no disruption to your work processes. And the way you want to think about it is that it provides a next level of performance, but it really enhances the traditional Spark optimizations we talked about. So any optimizations you do, any optimization Spark does, catalyst does, the optimizer, acceleration will just ride that because it’s accelerating on a per task basis in general. And so the use cases for this performance, I think I don’t have to go into much detail here, is basically some folks want the highest performing analytics, right? They have their infrastructure, they want the high performing analytics on it.
The ability we already mentioned to leverage more data. You know, when I was a data scientist, I used to worry about how much look back I have on my historical analytics. You know, this can really help with that. You can ingest more data sources, larger size data sources. So it really can open up the diaspora of data that you can use. And it’s really overcoming cluster scaling limitations. This can be used in conjunction with cluster scaling to really give you the maximum performance for the buck, because it has these costs and power advantages over simply buying servers or expanding your AWS cloud size in the CPU realm. And so that can result in a total cost of operations savings in many, many cases.
So thank you. That’s my talk and I appreciate your attention. And if you have any questions and want to have further discussion, I know we have question and answer, please feel free to contact me out of band on my email or via LinkedIn and I’m happy to talk about this technology, which I’m very, very excited about. Thank you.
Roop Ganguly is Chief Solution Architect at Bigstream Solutions Inc., managing all the customer and partner relationships from both a bizdev and technical standpoint. I hold a PhD in Electrical Engine...