RAPIDS is a set of open source libraries enabling GPU aware scheduling and memory representation for analytics and AI. Apache Spark 3.0 uses RAPIDS for GPU computing to accelerate various jobs including SQL and DataFrame. With compute acceleration from massive parallelism on GPUs, there is a need for accelerating data access and this is what Alluxio enables for compute in any cloud. In this talk, you will learn how to use Alluxio and Spark with RAPIDS Accelerator on NVIDIA GPUs without any application changes.
Adit Madan: Hello everyone. Welcome to the talk: Advancing GPU Analytics for Apache Spark with RAPIDS and Alluxio. My name is Adit Madan. I’m a product manager at Alluxio, where I’ve been for the last five years. I’m also a member of the project management committee of the Alluxio open source project. I’m very excited to present this talk with Dong from NVIDIA, without whom this collaboration wouldn’t have been possible. For those of you who are not familiar with Alluxio, Alluxio is an open source project, which came out of the same research lab at the University of California, Berkeley, as Apache Spark, there are companies using Alluxio across different sectors. As you can see on this screen, especially technology, financial services, telecommunications, and media, and also genomics and pharmaceuticals.
Now there are a few trends which justifies the whole premise for the talk today. When we look at the evolution of the big data ecosystem, we started with a co-location of compute and storage, with one compute framework, which was Hadoop and one storage system, which is HDFS. Since then, we’ve come a long way, and the first step in that journey was a dis-aggregated model in which compute and storage were not located on the same set of nodes. We’ve also transitioned from HDFS to new, cheaper, and modern object stores. And at the same time, there have been new compute frameworks, which come into the picture each, which does a good job for a specific task. At the same time, there has been improvement in processor technology, including CPUs, and GPU’s. Now, when we look at the acceleration provided by GPU specifically, we’ve come a long way. And this speeds up their computation phase of an application, but at the same time, data access is a problem if not addressed.
With the advancements in GPU computing, they drive a higher IO traffic when compared to CPUs and that’s what we’re going to focus on today. The data access path and how that affects end-to-end compute jobs. Alluxio is a data orchestration system. It’s a layer which sits between different compute frameworks and different storage systems. You can see different compute frameworks on the top of this picture, and you can see different storage systems at the bottom. Alluxio as a layer, which sits in the middle, manages movement of data across these refute systems. Now, this movement itself could be access-based. For example, when a compute framework on the north accesses a storage system on the south data is moved on demand from the storage system to the compute system. And only the data that was accessed is moved. At the same time, a data movement can also be policy-based, where data is moved across storage systems based on policies such as access time.
Modern data analytics or AI pipeline typically consists of a series of steps. This could be Ingest, ETL, Analytics, or Model Training. Now, different frameworks can also be used at the same time for different stages of the pipeline. Alluxio as a system sits between the compute across different stages and the storage systems. Now, when a Spark job in the Ingest phase writes to Alluxio, the same data is cached in Alluxio and available for future stages in the data pipeline. Today, we’re going to talk about a stack with Nvidia GPU’s for compute acceleration and a Lux tier for Iowa acceleration.
The solution which we are going to talk about is applicable across different scenarios. As you can see on this picture, there are two scenarios to keep in mind. Today’s is an all-in-cloud scenario, which is the picture on the left hand side, which could be managed Hadoop or Kubernetes in a single cloud. Spark running on EC2 or similar cloud instances and accessing storage, which could be in cloud object storage. Now, Alluxio is typically deployed on the same instances as Spark and as a separate process of which caches data and manages data when accessed across different storage systems, if that’s as cloud object storages. But at the same time, at the same solution is also valid on premises where you have an enterprise Hadoop or Kubernetes offerings, with Spark and Alluxio accessing on premise object stores, as you can see in the picture on the right hand side. A few key takeaways from the doc today that I would like you to keep in mind.
One is that Alluxio as a system unifies different data lakes. So you could think of if you have multiple different storage systems in a single cloud or across a private data center and a public cloud, Alluxio is a unifying layer, which makes all data available presented in a single unified name space. The second thing is that Alluxio manages efficient representation of data and caching for fast data access. At the same time, the solution is applicable across different environments, as you’ll see in later in the talk. With that, I’ll pass it on to Dong. Who’s going to talk about our integration, our collaboration with NVIDIA.
Dong Meng: Thank you, Adit, for the great introduction. Now, let me introduce myself a little bit. My name is Dong. I’m a Solutions Architect at NVIDIA. I specialize in accelerator performance tuning for machine learning, deep learning and the data processing workflows. To give more information on RAPIDS Spark. So, the RAPID Spark development is a joint community effort led by a media engineer team. So there are two major Spark features contributed. The first is, as we know, Spark runs on various cluster scaling environment, such as Yarn, Kubernetes, or standalones. We enable GPU support in all those scalers so that Spark can discover, request, and assign/allocate accelerators using Spark APIs. And later on in, the future Spark also will be able to dynamic request GPU resources at a more granular level, like Spark stage level. Second layer, Spark re-enables columnar processing, where it extends proxy core engine catalyst to allow the sequel plugin, integrated to be used on columnar format.
This means a Spark node client can actually exchange RDD of column batch instead of RDD of column’s Spark SQL roles. Through all those Spark’s features, we brought that GPU acceleration to Spark, and then we leverage the CUDA-X stack, which is the foundation of NVIDIA GPU software ecosystems. Currently in Spark, rapid Spark accelerates Spark SQL with Spark frame APIs. We also accelerate machine learning libraries such as XGBoost and the deep learning libraries. So, all those changes to [inaudible] accelerations is minimum. So, we built all those acceleration on two pillars. The first is RAPIDS and the second is UCX. So, looking at the Spark, there are really two things we want to accelerate, and they are interweaved. It’s Spark SQL and Spark Shuffle. First of all, rapid Spark SQL plugin is what we use to accelerate Spark SQL. It is implemented with CUDF, CUDF is an open source library that uses error-based data in GPU format.
So our spark developers create the Java API on top of CUDF to be leveraged by Spark. So with a RAPIDS Proxy Core plugin, we provide transparent dropping replacement. The user only makes changes to configurations. We know that in a Spark SQL workflow, first we can convert the SQL query to a logical plan and apply some sort of optimization, like a filter push down, then convert a lot of your plan to a physical plan, and then compare the physical plans through a cost-based model. Then it goes to the [inaudible] to generate the diag of RDDs. So, our organizations happens before cog-gens. So we, before co-gen and we place and we provide the job pay replacement of GPU operators or CPU operators so that we keep all the logical plan level optimizations. And for the newer Spark release, we also work really well with the newer features like AQEs in Spark.
For Spark Shuffle, we elaborate Co-UCX. It’s known by many Spark developers that Shuffle is the biggest challenge to use with Spark. With UCX, we can combine the Shuffle data into larger chunks, and also enable GPU centric movement over network, storage, and the GPU systems.
So, how do we experience experiment with RAPIDS Spark release? RAPIDS Spark is available in all clouds. First of all EMR was shipped with every EMR released after 6.1. We also provide bootstrap scripts for Google Data Pro and the Databricks. So in all those environments, we also work with other vendors like Cloudera to provide the support in various in enterprise environments. Either of those environments is very easy to allocate a scale-out T4 GPUs. The price is really good. So it’s very economic, and they use it to spin up a rapid Spark cluster to give it a try yourself.
Now, with all acceleration, you mentioned the Spark SQL queries that used to be compute bound is now IO bound. So we observe that storage and network often become a bottleneck, especially with modern cloud enterprise data architectures. There are multi-cloud, hybrid-cloud. So, this led us to think of, we can leverage data caching orchestration system to help the performance. And that’s where Alluxio stepped in. We worked with the Alluxio team closely. I just want to give a shout out to Adit Madan, and for their help to make this possible.
Alluxio and Spark are really, you need to use together. So we can say right, for users using Alluxio, we can just install the RAPIDS Spark plugin, and punt the Spark SQL to the existing Alluxio files. For existing Spark users, after deploying Spark RAPIDS, we also provide a configuration in Spark RAPIDS to map a storage to an Alluxio namespace. So that either way, the user can combine the Alluxio with RAPIDS Spark easily.
So, to show the value phase, we ran a set of benchmarks in a cloud environment. Here, we chose a Google Dataproc because we have in his script to make it really flexible, to deploy and configure Alluxio with RAPIDS Spark. We can say that GPU nodes cost more than CPUs, but because our GPU software is so much faster, the QRI runtime cost is actually much lower.
And also in a typical analytical workflow, such as machine learning or AI pipelines, it’s often a sequence of steps that happens iteratively on the same set of data. So that’s why we think that Alluxio can really help to boost out the performance. Here, we use a set of standard SQL queries that we call immediately student support queries to do this benchmark.
So applied both is spark GPU and Alluxio, we ran about 90 SQL queries on three terabytes of CPU, CSV data on multiple tables, and the compress them to one terabyte of [inaudible] files. And these charts show that with the GPU cluster and also with Alluxio deployment, we provide nearly 2x of improvement in performance. When comparing the total times, cost of the 90 queries, and a 78% better return on investment compared with a CPU cluster. We want to note that not every query is GPU friendly. Those are these common queries that you might be using, but we see a great overall speed up. So, we want to use that to demonstrate benefits.
So to make the best use of Alluxio and RAPIDS Spark, we recommend the user to look into a few configurations. First of all, we want the user two co-locate Alluxio workers with Spark GPU workers. So this enables the application to perform short circuit, reads and writes with local Alluxio workers, which is more efficient than remote. And also, it’s very important to allocate enough space to Alluxio cache. In fact, it’s recommended to have enough cache space for formattable copies. And also, users can choose between memory and SSDs as a cache medium for the Alluxio workers. And as the data is small enough to fit in your system memory, it’s often beneficial to deploy more economical SSDs as a caching layer. It also improves the TCO.
The RAPIDS Spark side, first of all, the spark.rapids.sql.enabled is the main switch to turn on and off the GPU accelerations. And also when as a user, when we debug the queries, it’s very useful to use spark.rapids.sql.explain. So we, when we started to NOT_ON_GPU, it only printed out the incompatible ops that cannot be run on GPUs. Because the bottleneck, when you use a RAPIDS plugin, is that often we hope, we see that the operators in Spark queries hops between the CPU and the GPU, so there’s a lot of data movement. We want to avoid that. And also, there are some common, basic configurations such as the concurrencies. So we know the concurrencies in Spark is controlled by the number of tasks per executors. With RAPIDS for Spark, we introduced an additional parameter called rapids.sql.concurrentGpuTasks. With this, we can further control the GPU task concurrency, and to avoid the auto memory exceptions, given the limited GPU memories.
Some of the queries significantly benefits when setting this to a value of between 2 and 4, and typically two provides the most benefits. And it is also beneficial to higher number of tasks. For example, four with, for example, two concurrent GPU tasks, so that we can offset the IO on the compute. For example, one task may be communicating needs that this file system to flashing input password, and it is not actually doing a lot of compute, but the other task can be decoding an input buffer on the GPU, which is actually compute-heavy. So that we kind of always have some tasks doing IO, some tasks doing compute, which is good for performance. However, when you configure too many concurrent tasks, it might lead to excessive IO and overload the host memory.
So we typically found that two times of tasks, two times of the number of concurrent GPU task is a good number to set for the task per executors. And also setting spark.rapids.memory.pinnedPool.size also helps, because it improved the performance of their transfer between the GPU and the CPU memory. As the transfer can be performed [inaudible] from the CPUs. Ideally, the amount of pinned memory allocated will be sufficient to hold import partitions for the number of concurrent tasks. So that Spark can schedule for the executors. Okay, I’m going to hand it back to Adit, and he will talk more about new, exciting new features in Alluxio.
Adit Madan: Thanks, Dong, for the great section. Next, I’m going to talk about some core features in the Alluxio, which enabled the solution that we just talked about. Alluxio is, like I said, a data orchestration layer, which is accessed using different interfaces. So, with a system like Spark, it typically accesses Alluxio using the HTFS compatible API, which enables no application changes when using Alluxio and RAPIDS. Now, Alluxio as a layer in the middle talks to different storage systems, using different drivers, which you can see on the bottom side of this picture. Talking to GCS or S3, we have used the S3 driver, for example, to talk to Amazon S3 and indicator of GCS. We also have a driver to talk to Google Cloud Storage. New machine learning applications, or by touch applications, they talk to Alluxio using a POSIX interface. And that’s also the interface that TensorFlow uses to talk to Alluxio.
Now, the beauty of having multiple different APIs is that Alluxio enables a global data platform for various different applications, which may be present in your data platform itself. One key thing to mention here is that multiple different storage systems, they are presented to the application in the same Filesystem Namespace. The way this works with Alluxio is that different storage systems are mounted to a particular section of the Alluxio Filesystem Namespace. As you can see in this picture, Alluxio, the root of the Alluxio filesystem is pointing to an HTFS location on the right-hand side, but at the same time, a directory called data is backed by two storage systems at the same time. Data, as you can see that it has sub-directories called reports and sales, and these two sub directories are present in HTFS, but also ask three at the same time. This powerful feature enables features in Alluxio for policy-based migration, in which data can be moved across different storage systems without any interruptions to the applications.
From a performance standpoint, in the solution that Dong mentioned, the reason why we see a higher performance with Alluxio in the mix is that Alluxio managers, certain tier storage, local storage on the nodes on which Spark is running. So, we would typically carve out, for example, SST resources on the nodes in which Spark is running for Alluxio to manage. And Alluxio manages this local storage for caching data, which is remote. So in the case of Google Cloud Storage, any data which is accessed, and is hot, is managed in Alluxio cached transparently to the application.
This intelligent multitiering ensures that we see performance benefits for data when it’s cached, and data is evicted at the same time which is not being used anymore. At the same time, Alluxio caches metadata with consistency. So, when you access the file in GCS using Alluxio, Alluxio would cache the metadata in the Alluxio file system master. Alluxio has a master worker architecture, in which master is responsible for the metadata, and the workers are responsible for caching the data itself. Now, the cache in Alluxio is consistent, so if you make any changes to the file system, such as a mutation on file one, as you can see in this picture, that those changes are reflected on the Alluxio file system master. Now, data operations are, can be synchronous or asynchronous. Synchronous data operations are when you write to Alluxio and you also want the rights to propagate to the underlying storage system.
In the case of reading, which is the case in the joint solution that we presented today, cashing actually happens asynchronously. So when you read a data, the data is read for the first time, we return to the client that the reader gets it. It’s a success. But we also cached the same data asynchronously in the background, reading from the underlying storage system, such as GCS into Alluxio managed storage. Now the same possible for writes, and you can trigger this loading of data from GCS into Alluxio namespace, by using a tool in Alluxio called distributed load. Now, the purpose of a tool like distributed load is that you can preload the data by specific triggers so that the first access of data doesn’t take a performance hit. This is true for about three tenths, right?
A few key takeaways from this talk, just to reiterate them. One is that Alluxio as a system unifies different data lakes without any application changes. In the solution that we presented, GPU applications with Spark can be migrated to using GPUs and Alluxio without any application changes. Data locality by Alluxio, managing storage, close to the compute on the same instances, that’s the compute, makes sure that you have the benefits of data locality and association performance. Now, the benefit of performance in this case, as we saw was that we saw 2x performance acceleration with Alluxio in the mix. And at the same time, with GPUs and Alluxio, we saw a higher ROI compared to CPUs. The solution with both RAPIDS and Alluxio is applicable to multiple environments across different cloud providers or on premises environments, either by Metal or Kubernetes.
Here are a few references in case you want to learn more about the submission. We have a joint blog with NVIDIA talking about the presentation page today. We also have a getting started documentation for RAPIDS and Alluxio together. We have documentation for RAPIDS itself and Alluxio, different [inaudible] on top of Alluxio. Thank you a lot for joining the talk today!
Adit Madan is a product and product marketing manager at Alluxio. Adit is experienced in multiple roles and is also a core maintainer and Project Management Committee (PMC) member of the Alluxio Open ...