Turbocharge Spark with Samsung SmartSSD® CSDs powered by Xilinx

May 26, 2021 04:25 PM (PT)

Computational storage devices like the Samsung SmartSSD® with Xilinx FPGAs offer an attractive vehicle to accelerate algorithms such as compression, video transcoding, and search. Today, Samsung, Xilinx, and Bigstream have further pioneered an approach to seamlessly accelerate the Spark platform on these devices, yielding an impressive 2x-8x performance gain with zero user code changes.

Attendees will learn about the technical approach taken by Bigstream middleware to accelerate Apache Spark on the platform. Finally, we present performance results for the solution using the TPC-DS benchmark suite and discuss the Total Cost of Ownership (TCO) savings this level of performance can offer the Spark user.

In this session watch:
Seong Kim, PhD, Sr. Director, Xilinx
Steve Tuohy, Head of Marketing, Bigstream



Seong Kim: Welcome to our presentation, Turbocharge Spark with Xilinx, Bigstream, and Samsung SmartSSD, my name is Seong Kim, I’m senior director of data center systems architecture team at Xilinx. I invited Steve Turi VP of marketing from Bigstream solutions. Today’s agenda is shown below,` introduction of Computational Storage and SmartSSD. And then we’ll talk about what is the Spark Acceleration, how it gets accelerated what’s the performance result and the followed by a SmartSSD technical details that will conclude it with the brief TCO analysis.
I’d like to briefly talk about industry around the computation of storage. First, there has been tremendous momentum within the standard bodies to support a common interface to computation or storage. SNIA Compute, Memory and Storage Initiative is driving to develop common use cases and terminologies, and then also driving Alliance to standardized interface. At the same time, the computational storage product have hit the market with the growing customer demand. In fact, we are happy to announce that the SmartSSD and joint solution from Xilinx and Samsung have passed this qualification requirement and made it a better as a mass market computation restoration drive. I would talk about what SmartSSD is, and then how it operates and what kind of acceleration can be done in the later section of presentation.
Before we delve in to how spark acceleration works, I’d like to introduce what the common use cases are for SmartSSD, and then what the key value that each use case can provide. The first use case is a transparent compression, smart assist. They can provide compression and decompression functions using the acceleration logic inside SmartSSD, and then you can provide a lower storage costs. And in other words, more storage capacity for the given dollar amount. And the second use case is a video file trans coding. We also call this faster than our real-time video trans coding. This, by utilizing video in code logic in the SmartSSD you can write video files to the SSD directly after including video file without sending the actual video file to x86. This can increase the video trans coding performance by offloading compute heavy the video in coding process from x86.
And then you can reduce the latency involved in that data is changed over PCI interface, by enabling peer to peer technology. On other use cases as such in storage the typical use case is a cybersecurity terabyte of the log data stored in the storage SSD in this case can be scanned by the acceleration logic in the SmartSSD and that it returns only the data of interest back to the host or the application. And then last on, that list today’s focus is Big Data Analytic, and then now this specific use case is based on Bigstreams software and then hardware acceleration technology. They have implemented a PGA based middleware to accelerate a part [phonetics] SmartSSD platform without user code changes. Here, I’d like to hand it over to Steve who can give you details around Bigstream Spark acceleration solutions.

Steve Tuohy: Thanks Seong, thank you to Xilinx for the partnership and including us in this talk. Thanks to Databricks for hosting the data and AI summit, as Don said, my name is Steve Tuohy and I’m representing Bigstream here today. I want to talk a bit about what Bigstream offers in terms of being pioneers in big data acceleration, and also talk a little more about what we mean by that. Overall, there’s a excitement in this space of acceleration for Apache Spark for big data platforms, overall. Let’s talk a little bit about what that means and how it’s relevant to the audience at Data + AI Summit.
Bigstream has been at this for about five years and our focus is Apache Spark. So we have a platform that will cover other platforms beyond Spark, but today the focus is Spark, and what we really focused on is ensuring that the Spark user has zero code change. It seamless in their usage of running Spark jobs with Bigstream. And basically at our core is making Spark jobs run faster. And we have a product in the markets today that includes both software driven, acceleration, as well as hardware driven acceleration and we’re agnostic to cloud or on premises deployments.
So I want to talk both about what acceleration is and what it isn’t. Spark is inherently fast, people will tell us, so what do you mean you’re accelerating this? Are you a different distribution than EMR or HD Insights or Databricks? No. We, with acceleration, build on top of the shoulders of the latest version of Spark and add acceleration to that. We’ll also get the question, Spark is meant to scale up scale out, bigger nodes, more nodes, so if I want to go twice as fast, can I just add twice as many nodes or nodes that are, that are twice as powerful? Yes and no. So that is an expensive proposition, both in terms of money, doubling your spend, but also communication overhead. So our studies and others have illustrated that there are not linear scaling in terms of adding nodes. So this chart on the screen right now is a straight line.
Just one test here shows as we add more nodes to a cluster with Spark alone, the performance gains are nowhere near linear, there’s quite a fall off. Things keep getting faster, but nowhere near in proportion to the number of nodes being added. The other question that will come up is just isn’t acceleration with spark or any platform, can I just wait for the next version or upgrade, and that’s where I get my acceleration? And we all really get excited about Data + AI summit and learning about all the developments in Spark that, that are included in the other talks around the ease of use of the products, functionality, speed gains are certainly part of that. So, you definitely will get the fastest instance on Spark 3.0 or 3.1 as you move forward. That again is not what we’re talking about with acceleration.
The extreme and acceleration works on top of the desk and taking each task or operation of a Spark workload or big data tools in general, and identifying which of those can be accelerated. So, I’m going to illustrate with an example here, so it’s the pandemic, I’m at home like millions of others, I decided why not work in the backyard? We built shed and now we’re finishing it. And so for me, I thought, well, what’s the project here. One person, myself, it’s not that handy. Maybe it could take two full days to finish this. Some combination of putting up drywall, nailing that in, painting it, sanding it there’s this specific sequence to it, but this is my project plan. Now I said, I can get this done faster with a little help from my friends or family and social distant some pandemics specific way.
And if we have a plan and we organize this, we can make this run a lot faster. So, this is sort of like going in a non distributed fashion, pre Spark, pre Hadoop, shifting to Spark. So we’re taking this two day process and now we’re going to accomplish it in six hours and the key to simplifying. But the key additions here is that we have an organized plan and we’re distributing this across multiple workers. That’s the analogy to Spark, which generates a data flow in the form of a physical plan. And then it, the benefits of parallel computing, distributed computing, multiple nodes in a cluster. Okay, so acceleration again, builds off what Spark has to offer. So it takes that plan and it takes those workers across a cluster. And what it does again, is it examines specific operations and tasks in that plan and says, is there a way to make this faster?
Is there a way to make this scan, this filter faster. Back to my shed is there a way to do the painting a bit faster? Is there a way to do the nailing and I make the analogy here to a paint sprayer or a nail gun. So if the spark is bringing together an organized team, whereas acceleration on top of that is taking individual tasks, then operators and accelerates that. And so Seong introduced the SmartSSD. That’s one of several accelerators that the Bigstream incorporates and now we’re going to deep dive a little into how that comes to light. To pull it back into the realm of the specific accelerators and Spark and to summarize, there are several inherit limitations to the max you can get out of Spark, that we’ve talked a bit.
So the diminishing returns to cluster scaling Moore’s law, basically the CPU is a dependent on for all the processing and most, most part jobs and the speed CPU is great. We all use them all the time for just about everything, they are a generalist, but there are better tools for the job, for a specific job, for a scan for a specific parallel processing for machine learning. And we can incorporate those in with some help with tools like Bigstream. This final limit is I’ve got there in the lower left is Spark, like many applications, is written in a higher level language, so it’s generalizable across all functions, but that leads some efficiency on the table for specific operations that can be accelerated with software. So shifting over this opens the door for the acceleration space. The space that a big stream is a leader in, and that we’re seeing a lot of excitement and action.
We are a middleware connecting platforms like Spark with advanced computing options, be it hardware, be it software. It builds off of Spark the physical plan. And we’ll look at that a little more closely in a bit. Targeting specific operations to accelerate and core at this, especially on the hardware side or particular on the hardware side ,is programmable hardware field, programmable gate arrays, SmartSSDs and other forms of computational storage, GPU’s and even ASIC, which is kind of its own category in a way. And on the non hardware side, Bigstream delivers acceleration with C++ accelerators. So to take it below that level of Java, to again, find specific tasks and operations that can be performed more effectively with Bigstream.
The smart SSD is one of the hardware implementations of acceleration. So all the hardware accelerators will examine the physical plan and go through task by task, operation by operation examined that Bigstream will examine that and determine which of these will benefit from a software acceleration or whatever hardware acceleration is available in your cluster. And so a specific operation, it might determine the SmartSSD, I was going to run this faster than the CPU. So you’re freeing up the CPU sending less to it, so it becomes less of the bottleneck. That works with any of our accelerators. What’s unique about the SmartSSD though, is that it actually limits what gets to the CPU and reduces some of that traffic and overhead that we talked about before, in terms of adding more nodes to a cluster.
And I’m illustrating this below with two timelines. One, the first one is running without an acceleration and without a SmartSSD. And so here you see data will move off of storage, off of an SSD as a transfer step. And that’s blue. In red, there is a filter and decompression that happens on the CPU and then further SQL processes in that teal color. The SmartSSD not only accelerates the filter and decompression and accelerates the SQL processes, but it does that filter and decompression in storage on the SmartSSD.
And so that red piece is shorter because it’s doing it more effectively. But then the big piece is also that, that blue in the middle, it is a much smaller volume of data that makes its way to the CPU and therefore puts much less pressure on the overall cluster. Even that part is more efficient with the transfer and then the SQL process at the tail end is also accelerated in a different format. So each of these three chunks are all shorter, but then the sequencing of these reduces that bandwidth and challenges there. The other key point I want to make here on the SmartSSD is some of the main computations that do get offloaded from the CPU by the SmartSSD or decompression and filtering, which we meant mentioned in that example, but projection, data format, processing and selected sequel operations as well.
So let’s take a look at some of the results that we have recently with a SmartSSD and a Spark cluster. These are across the a hundred deep TPC-DS query so the decision support part of this. So these are ad hoc analytics and right here across the board, you’ll first notice that we have almost double every single one of these grades. I think there’s one or two of them, maybe three of them that are just shy of 2X acceleration. So, on average across the board, this is 4.6X acceleration and as high as 7X or 6.6X for the top 10. So these are not applicable only to cherry picked queries, even though Bigstreams approach is to focus on certain operators and those that we’ve identified as most prevalent in your average spark job, this has wide applicability across different type in this case of TPC-DS queries.
Our customers actually see some of the biggest performance, not in necessarily in the ad hoc analytics type of workloads, but in some of the ETL and ingest big data pipelines into a Data lake, Delta lake data warehouse, where they really are facing SLA’s and we can help them meet those pretty impressively. This a query, if you see in the note there, this is on JSON data with compression, I’m using three SmartSSDs in this cluster, again the acceleration is built operation by operations specific to a data format, and so we build up that library over time.
So if this is so great and the results are so compelling, why doesn’t everyone just throw on FPGA’s and GPU’s SmartSSDs right in their spark cluster, if I’m a smart user. Well, it’s not that simple. In fact, FPGA programming and GPU programming are pretty intricate. And the fact of the matter is this audience Data + AI summit, the Spark user tends to live up at the top of this diagram. They’re focused on making better ML, AI models, optimizing data pipelines. They would tend, and you would tend not to really want to worry about the plumbing, the underlying hardware and compute infrastructure and vice versa, those who spend their time building a data center or optimizing a public cloud environment aren’t as close to the data platform. And so we described this programming gap of these two different worlds that makes it difficult to incorporate really the best tool from a computing perspective into an environment like Spark.
So, that’s where an organization and software like Bigstream come in and where we’ve devoted the last five years to building expertise to make this seamless to the Spark user. You’ve heard me say it a couple of times, but zero code change. This, I break up our innovation and these three buckets here on the right, but that first one, the main goal is to ensure that the Spark user doesn’t have to think about this, doesn’t have to change any of their spark code or heavy configuration changes and say an EMR, you can be up and running in minutes adding Bigstream to your Spark clusters. This middle piece, again I’ve touched on it and we’ll go a little deeper. Spark, pardon me, Bigstream examined Sparks, physical plan. Spark, other big data platforms, all have a data flow they generate and Bigstream in an automatic fashion, evaluates each component, each operator to determine which can benefit from the acceleration that comes from software, or advanced compute that we’ve talked about, like a SmartSSD.
Then the bottom piece really is a third category, which is the programming. The programming of the FPGA, the smartest as the, even that C++ native language that we bring in for software driven acceleration. So this is where we focus integrating them in for a seamless access to the hardware acceleration. So some organizations, so Nvidia is making a lot of noise in acceleration themselves, obviously with a focus on GPU in the last six months or so with rapids, they’re integrated, so they’d kind of be the bottom two pieces together here, but this is getting a lot of attention in a lot of spaces, not just with Spark. So I’m going to talk a bit about the numbers on here, but others, examples of big data acceleration are even Google’s TPU, so that’s them declaring that the is not the best tool for their MLAI frameworks, and so they can actually justify building a dedicated chip, an ASIC, in the form of a TPU.
Amazon announced late last year that it’s integrating FPGAs into the Redshift data warehouse so that they can accelerate their own tool. Facebook has some internal pieces with ASICs as well. A couple other examples that I have there on the bottom are rENIAC which does so with Cassandra, TigerGraph with graph databases pulling in FPGAs, and so to come back to the visual on the screen, this investment firm identifies accelerators, the hardware side of this as becoming a $41 billion industry over the next 10 years, and actually exceeding the CPU sales in the $27 billion. That would be in the lower right there and driven heavily by this space, by data analytics, big data analytics and artificial intelligence in particular. So we’re seeing a lot of excitement here.
Bigstream has put its emphasis on FPGAs and SmartSSDs and software acceleration, based on our assessment of the longest running components of Spark and where we felt we can have the biggest impact. My colleague group Gang Gooley is giving a talk tomorrow, Thursday morning, here at the Data AI summit, evaluating different approaches, GPU’s FPGAs, in particular, and some of the pros and cons about those. So be sure to check that one out you leave the show.
I want to shift back over to how it works. This is a somewhat simple diagram of how Spark works without Bigstream, and I want to go through this to emphasize how we can deliver this seamlessly. We are not getting in there and mucking up Spark. So the Spark driver and those writing spark code over on the left, ultimately that gets taken and catalysts will generate a logical plan or rather several and unresolved, resolved optimized logical plan got that narrowed into one step.
And then ultimately a physical plan that gets sent off as Java byte code to a cluster of usually CPU nodes. Here’s where it big stream comes in and what the world looks like when big stream and Xilinx FPGA, or a Samsung Xilinx SmartSSD come into play. Bigstream will evaluate the physical plan through data flow analysis and generate a new enriched physical plan, looking at our hardware accelerator templates. And again, each task will evaluate can this be accelerated with software acceleration. Can this be accelerated with hardware acceleration and so pushes off to the cluster that now has these accelerators built in. So again, this does not touch the inner workings of Spark and so it is a seamless execution. And going all the way back to the top the innovation of each Spark version really is all the pieces that aren’t in blue here. And again. So we take what Spark accomplishes and then send things off to the best compute approach for the right task.
Just before I hand this back over to Seong, I’m going to show this same view where it’s SmartSSD is profiled here. Just one last point I want to make on the SmartSSD in particular, in at the storage layer, there will typically be multiple SmartSSDs in the cluster. And so the data will be partitioned across those and so as it’s running the computations on the SmartSSD, which offloads the CPU, it’s doing so across the partition dataset to really have that big impact on overall acceleration. So thanks for your time and I’m going to hand it back over to Seong to show how the SmartSSD really works in a Spark cluster.

Seong Kim: Thanks Steve. As mentioned earlier, let me introduce what SmartSSDs and then how it operates, and then what kind of exploration can be done. From the outside it looks like just regular standard NVMe SSDs, The inside there’s a 4 terabyte of Samsung V-NAND and then now also it has an FPGA that can provide acceleration functions and then a 4 gigabyte of the DDR memory for a PGA to use. So next slide shows how that acceleration is processed. So x86 and the data processing command to SmartSSD. PGA reads data from the flash directly attached to the PGA, and then process it locally. Then only the process result returned back to x86. So this basically no CPU intervention is required. And then also the returned result is the data of interest only. This picture on the right shows the detail I just mentioned in the previous slide. The picture on the bottom shows how data pipeline can be designed with acceleration functions such as decryption, decompression, parsing, and filter, and then aggregations.
And we can create customized data flow by stitching each functions in any order, any the fashion. We are in the process of making SmartSSD available in the cloud. And then now we also have partners who can provide turnkey solutions. For example, the Bigstream providing the Spark accelerations. CTX providing a video transporting solution at our level provides cyber security solutions [inaudible] can provide, provides the data compression and decompression, and also some rejects functions as part of that, the acceleration solutions. This chart to show us a SmartSSD technical spec, as mentioned before, it’s a UDA two form factor of the SmartSSD, and then it comes with PCI 10, three by four interface, and then 4 terabytes of enterprise class, SmartSSD.
Okay. I’d like to conclude our presentation with this brief total cost of ownership analysis. You know for us to calculate this TCO, we actually have done a lot of the additional calculation and then analysis and performance measurement. So the performance result used here is what Steve presented in the previous section, we are able to achieve around 4.3 X acceleration with a SmartSSD accelerations. So the tested configuration here is an Intel platinum class processor CPU, and three SmartSSD per individual server. So in order for us to achieve 1000 queries per hour, how much you have to spend with the CPU only solution versus what is the CPU plus SmartSSD with the Bigstream acceleration is shown here. So you can see, as you can see, you can save more than 60% of the TCO. And then this is the total cost of ownership for three years. So this one can show you the key value of acceleration, how much savings you can make. So thank you for listening to our presentation. If you have any additional questions and then please reach out to us. Thank you.

Seong Kim

Seong Kim

Seong Kim, Ph.D., leads Xilinx’s system architecture team with a focus on machine learning, video, video analytics, database, SmartNIC, network security, and storage. Prior to his tenure at Xilinx, ...
Read more

Steve Tuohy

Steve Tuohy

Steve runs the marketing organization at Bigstream and also works extensively on broader go-to-market efforts. Steve has had leadership roles in the data management industry at Alation and Cloudera as...
Read more