Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based FPGA Accelerators

Download Slides

In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application. Current Spark SQL also leverages WholeStageCodegen to improve performance by generating runtime Java functions to eliminate virtual function calls and CPU registers for intermediate data. In this session, we would like to describe how FPGA can help typical Spark SQL workloads to reduce high CPU utilization and release more CPU power by leveraging a new WholeStageCodegen to generate runtime function calls to process data with FPGA. We also use Apache Arrow to hold Columnar Batch data inside native memory and manage its memory reference inside Spark, as well leverage the Apache Arrow Gandiva for just-in-time (JIT) compiling and Columnar Batch data evaluation.

To enable FPGA support in Spark SQL, operators process multiple rows in one function call, and one batch process function can process more data with fewer time. Which is to say, leveraging FPGA accelerator, we can move the CPU-intensive functions such as data aggregation, sorting or data group-by and large data sets to use FPGA IPs and reserve CPU resource for some mission critical or complicate tasks and limit the data moving as less as possible, which can improve the performance dramatically and enable Spark SQL to drive its performance to a next level. Finally, we will use micro-benchmarks and a real-world workload as use cases to explain how and when to use these FPGA IPs to optimize your Big Data applications and identify a typical Spark SQL workload profiling, highlight the hotspots during data aggregation, sorting and group-by, and figure out which function costs higher CPU utilization.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi, everyone, this is Weiting Chen from Intel. Today is my honor to be here to share with you about our experience for big data and the FPGA in the equation.

Accelerating SPARK SQL to 50X Performance with Apache Arrow based FPGA accelerators

So Intel has already co-work with Wasai Technology company for several years to provide the FPGA based solution for big data. So today is also my honor to invite Calvin, the CEO of Wasai Technology company, to join with us and also to share about their FPGA experiences from Spark.

So this is today’s agenda, we would like to highlight about the current problem and the issues in Spark. And those are how we can use our solution to address loads problem and also to solve loads issues by using our solution, and the Wasai company will help us to show more data about the performers, and also about how FPGA can support Spark to let Spark run well.

So, this is our background, I don’t want to go there, it’s just a quick reference to help you understand what Calvin and I are doing, and also what we have done.

So go back to see our motivation for the solution, as you can see, there are two major reasons. The first one is about how we can use, just unleash the power from the CPU by using SIMD instruction set support, just like in CPU, there are SSE or AVX2 support for SIMD instruction. So, we would like to do is about to provide, to enable load technology into the Spark. And for the second reason is about, there’re so many different accelerators in the world like FPGA or GPU, and we would like to provide a heterogeneous platform, then all the accelerator can run with CPU together and also to speed up the performance. And then we have seen, there are so many new features and Spark 3.0 can help to support this. And this is why we would like to leverage the power in Spark 3.0 and also to create a heterogeneous platform in the world.

So, when we think about those solutions for FPGA and the Spark, there are so many issues in code and the Spark. When you think about the, all the, almost all the accelerators they are based on columnar best data processing. So, there will have overhead about to transfer the data from row fast format to columnar best the format. So, you have to consider about this overhead in Spark, and this is why we would like to provide AVX support in Spark since AVX is also based on column based data processing. So, we would like to provide the columnar-to-columnar and the end-to-end data processing workload to handle the data just to use a columnar format in Spark, and also when we consider about how to integrate with load accelerators, there’re some new features in Spark 3.0, can help us to achieve this goal, but besides those, we have some other problems still need to solve just like how to coordinate FPGA and CPU by using a better way and also to reduce the data copying or serialization overhead when we copy the data from hostile memory into the device memory, and there’re also some performance issue when we transfer the data by using the DMA engine.

So our solution is actually based on loads problems we would like to solve and finally, we figure out we must use Apache Arrow and some new features in Spark 3.0 to create a plugin with recorded Intel OAP Native SQL Engine plugging, and by using this plugging, we can support Spark with AVX support and also to integrate with some other accelerators, and also we collaborated with Wasai Technology company into working on plugging and also to combine with Leon solution with our own plugging together to create a heterogenous platform.

So, our solution is actually to target for end-to-end columnar-to-columnar data processing. Just like I said, by using this way, it can avoid the overhead for columnar-to-row or row-to-row columnar issues in code and the Spark. And also either can just leverage AVX support and also to run faster by using a columnar faster data processing. We’re also working with Wasai to integrate FPGA support. So, by using this way, it can also offload sounds the best civic operators by using FPGA and also can let Spark running better.


So, in our own solution, one of the key is actually to leverage the power of Apache Arrow. In Apache Arrow, you not just to leverage the Apache Arrow, but train the different framework, but also you can use Apache Arrow to transfer the data between hoster and the device. And furthermore, if your accelerator is also using arrow format to the computer, then you can also get the benefits to reduce the data transfer from arrow to a custom format when you would like to compute it in your accelerator. So by using Apache Arrow, it can get unified in memory format as a data format, and then to just the computer data by using whatever you want.

And we also leverage some new features in Spark 3.0. Actually, there are so many new features in 3.0 and whether you would like to highlight these two features. The first one is about public API for extended Columnar processing support.


So (murmurs) used the least features to integrate with their own rapid API and also to leverage GPU support by using this feature. In our case, we also use the same way to support Native SQL Engine plugging, and also to communicate with our own plugging engine and also to transfer the data format by using arrow format, then to use an arrow computer to run the entire SQL engine. And also then the SQL command can run by using a columnar way, and also to leverage the AVX support.

And for the second feature is about accelerator aware, task label scheduling. It’s also a very important part when we consider to leverages GPU support by using these features. So we’re also under discussing with Wasai Technology company to get, FPGA can also support this feature in the future. So, in the future, we can see, in Spark 3.0, we can create a very good platform for heterogenous system and also to allocate CPU and the other operators by using Spark view in technology.

Intel Optimized Analytics Package(OAP): Native SQL Engin An End-to-End SPARK Columnar based data processing with Intel AVX support

So in this Tiguan, it is a framework about our Intel, our OAP Native SQL Engine plugging. So you can see there are three major features. The first one is about arrow best data source processing, so when you read the data from a pocket file, you can just use it on JSON API to access data and also to transfer it to arrow best, he format in the memory. And the for second part is about arrow best data processing. So we just rewrite the Master SQL operators by using a native way and also leverage every access support. So this way can help us to handle the data when your data has already in the memory with arrow format and also to process the data, to compute the data by using Native SQL Engine and also to get the support from AVX. For third features is about columnar our best Shuffle. This is also leverage arrow best columnar shuffle, so there are two major benefits. The first one is about to avoid the serialization issue by using arrow best the format. And then the second benefit is about, we can leverage AVX support to full loads compression codec and then the compression codec can run them faster by using arrow format and AVX support. And we are also working with Wasai to integrate their own FPGA solution into our plugging. So, eventually it can be a very good heterogeneous platform by using this plugging with Spark.

So this is an example to show you about the details when we compare with Columnar and Spark, and also with our own plugging. So, in current Spark since, it can only support arrow best data processing, so when you read a parquet file, you must transfer the data by using a format, we call it Columnar vector, and then you may need to transfer the format to internal role, then the Spark can process your data row by row. After that, you may need to transfer the data again to Columnar vector or format, then to leverage columnar writer to lighten the data for better performance. So in our plugging, we just leverage Apache Arrow as a single format, and then when you read the data and also to process those operators, you just have to copy the data when you’re running a file scan and also use Apache Arrow format to process the data by using a Columnar Vector operator, like the join in or group by. After that, you can also transfer the data to third party FPGA accelerator by using arrow best for the batch format. So, if your accelerator is also use arrow best format, then you can also reduce the data transfer again when the data in your kernel. So, after that, you just need to come back to the data with the arrow return the batch format and then the CPU can handle the rest of the operators which your FPGA which cannot support. So this is, entire idea is actually based on Apache Arrow and also to create a very simple data transfer system when we are running those data processing.

And this Apache is actually to highlight about our OAP Native SQL Engine plugging. So I just did this on the features and also most of our operator has already implement and we are keep going to, then it can complete and also to, to achieve our goal just to get the end-to-end Columnar-to-columnar platform. So this plugging is actually totally open source. So we would like to share with you about this link on the top of the page and it will indulge you to go to the website and also to try it by yourself. So, in the next patch, I would like to introduce Calvin Hung from Wasai Technology company, and also to share with us about how FPGA can help our Spark better. Thank you. – Thank you so much Weiting. Hello, everybody, I’m Calvin Hung, CEO and co-founder of Wasai Technology. We are based in Taipei Taiwan. Glad to have this chance to talk about what we have been working on Spark and Hadoop acceleration for this FPGA for more than four years. Weiting has already been talking through Apache Arrow and probably dealt on Spark acceleration. So I’m going to start with just a quick overview of what has been done with FPGA accelerators on Spark. The issues encountered before Apache Arrow comes into picture, why and how we are trying to solve that with Apache Arrow and FGPA accelerators together, and some of the things that are part of it.

I’m not going to go into too much technological detail about how we used to implement arrow due to the time limitation, but it will come to you come to us if you are interested, and some of the things that we have learned from in terms of all the performers can benefit when you are working with arrow, and some of the accelerators we came up with. So hopefully, it will be valuable to you.

From us and the Spark communities experience, we learned a lot about why SQL is not good enough to deal with big data on the CPU intensive context, and the GPU and FPGA are brought into this context. FPGA essentially is more energy efficient, reconfigurable and can accelerate this specific test is cross inserted. Many means have been introduced to tackle the FPGA challenge, however, even with the state of the art compiler technology for Java, or C like as it was existing, or open CL2AR, memory and data streaming issues are still too complicated to be taken care of. Under the context of modern always memory native system, usually rocks, needs kernels. In order to offload jobs from JVM to FPGA, data are read into JVM before set as strength to any barrier to entry code. Overhead is expensive in memory, and its management is brought into this picture to improve efficiency. Arrow can help to hold data in native memory from the scratch and then at the very beginning, improving the manager here and reducing overheads.


Here is what we view as a first experiment with Sparks SQL, FPGA and arrow together. We use arrow to read parquet data into native memory and user data in native memory deposits back and forth with native data and FPGA. This is a simple and monetarists arrow of Spark SQL to work with FPGAs through arrow.

As you can see, there is not much JVM involved here and thus reduce the overhead of JVM itself and Java interface, the JVM to native part.

And furthermore, we can still improve efficiency with better design architecture.


A normal Spark SQL query will go through these stages to decide the physical execution plan. At first, we’ll spend some time to pick out the architecture of Spark SQL and find out that the best way to get Apache involved with Spark is to make Apache involved in code generation execution.


We leverage (mumbles) architecture by modifying current Spark SQL framework to offload tests from CPUs to FPGAs. It’s almost a common purchase for FPGA accelerators with Spark SQL now, the code for FPGA execution is to be generated with physical plan at pump generating stage. So at pump execution state, the data collected previously will be processed by FPGA through the code generated by code generation state.


We make a small benchmark. Our benchmark is a simplified TPC-DS Q55 by removing sold operation. We want to see if arrow can still help if the query is maths on complicated, and also it’s secure intensive. We tested it with TPC-DS data set, as several different scales and obtain a 33% improvement in performance comparing the one and with resolve arrow involved. You can see that the speed up is getting greater speeder data state level. With testing, and there’re tools in YARN static CPU server is wanting to take a working code.


We also provide libraries in each system there for different application levels to get involved with our execution framework and FPGA. You can leverage different libraries depends on different levels of usage, you would like to involve FPGA and modify your application code. Usually the lower layer library can get you the better efficiency, despite more application chance.

In addition to Spark SQL acceleration with FPGA, we also provide several other FPGA accelerators for interest Spark operations such as group by key, for by key, or RTP. Generally start with teams solve for Spark and Hadoop, several compression codecs, visual coding and file format causes the importance of file forming policies, also they must Apache Arrow which a certain way we do is with different level of performance as you can see on the table. The most of your intensive operation is, the better performance boost we can get.

The key takeaway today is to introduce end-to-end columnar data processing to avoid extra overhead for columnar-to-row or row-to-columnar as we mentioned. And native file support is AVX and AV2X acceleration. Also see where our engine overhead can be greatly reduced with Apache Arrow.

Last but not least, FPGA can also help to accelerate Spark in all kinds of CPU intensive operations.


More features will be created in OAP Native SQL Engine by Intel in the community, in the coming modules with open source of it. In integration of OAP Native SQL Engine and FPGA accelerators, are also underway in Wasai and Intel together, hopefully to be done in a couple months.


We encourage you to try OAP Native SQL Engine for Spark from GitHub, and also please come to us if you have more interest with our SparK FPGA acceleration solution.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Calvin Hung

WASAI Technology, Inc.

Calvin is the co-founder, CEO and CTO of WASAI Technology, which is specialized in FPGA-based datacenter accelerations for Apache Spark, Apache Hadoop and Genomics Analysis applications. He has more than 15 years of experience in software and hardware architecture co-design and performance optimization across a broad range of platforms, including distributed systems, embedded systems and enterprise datacenter applications.

About Weiting Chen

Intel Corporation

Weiting is a senior software engineer at Intel Software. He has worked for Big Data and Cloud Solutions including Spark, Hadoop, OpenStack, and Kubernetes for more than 5 years. He has also worked for big data and Intel architecture technologies research including CPU, GPU, and FPGA. One major responsibility for him is to research & optimize Big Data technology and enable global customers to use Big Data with Intel solutions. Weiting is working on next-generation big data technologies on Intel x86 platform.