How R Developers Can Build and Share Data and AI Applications that Scale with Databricks and RStudio Connect

Download Slides

Historically it has been challenging for R developers to build and share data products that use Apache Spark. In this talk, learn how you can publish Shiny apps that leverage the scale and speed of Databricks, Spark and Delta Lake, so your stakeholders can better leverage insights from your data in their decision making. They will walk through how to decouple a Shiny app from a Spark cluster without losing the ability to query billions of rows with Delta Lake. Learn how to safely promote models from development to production with the MLflow Model Registry on Databricks. By tracking model experimentation with MLflow and managing the lifecycle with the Registry, organizations can improve reproducibility and governance when publishing artifacts to RStudio Connect for batch or online scoring with Shiny or Plumber APIs.

Sample of topics discussed:

  • The best way to leverage Spark for a Shiny app, and how to make that Shiny app reliably available to your decision makers
  • Benchmarking performance of connecting to Spark from Shiny natively or via JDBC/ODBC
  • Programmatically managing models trained on Databricks with the MLflow Model Registry Exploring different serving patterns for MLflow models with R

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Thank you, everyone, for joining us in our breakdown session here. Today we’re going to be talking about how our developers can build and share data and AI applications that scale with Databricks in RStudio Connect. So for today’s agenda, I’ll be presenting along with my peer from RStudio. My name is Rafi Kurlansik. I’m a senior solutions architect at Databricks. And just a little bit about me, I’ve been working in the data and AI space for about five years. I got my start by doing the Johns Hopkins data science specialization on Coursera which is taught in R, so I’ve kind of always been in the stream of the R ecosystem, and kind of learning as I go in that world. My topic is going to be around how to develop scalable R in Shiny applications with RStudio and Databricks. And James, why don’t you introduce yourself? – Great, thanks Rafi. As mentioned, my name is James Blair. I’m a solutions engineer at RStudio. I’ve been with RStudio for a couple of years. And I focus mainly on expanding the R tool chain by enabling R users to Connect and leverage other technologies, so things like ODBC, and API’s, using those from within R to expand the capabilities of R users and R developers. And Rafi is going to talk, as he mentioned, about the development process a little bit and setting that up within Databricks and using RStudio in that context. And then we’ll extend that and I’ll discuss how we can take those ideas and move them towards deployment using things like Shiny and still leveraging the power of Databricks as the background engine. – Thanks a lot, James. Like I mentioned before, I’m going to focus more on the development side of how to work with with RStudio and Databricks and build these scalable applications. So let’s get started.

So the question we kind of want to answer here, is how can we open up the data lake to R users? The value of the R ecosystem is apparent in the number of packages that are available for sort of the whole lifecycle of building data products, whether that’s from doing exploratory data analysis and data visualization all the way through to statistical modeling and so on.

And obviously developing Shiny applications that are interactive and can be shared with other users. So that’s all great, but you can kind of run into some problems when you want to access kind of combine that rich set of functionality and that power with larger datasets. So if you consider kind of like the typical way that you would develop R applications today, usually you’re going to be working on your laptop. So it’s going to be like a local environment that you’re working with. Or maybe you have a virtual machine in the cloud, one of the public cloud providers or an on prem virtual machine that you have access to that might have more memory than your laptop and more compute resources than your laptop. So you could be able to load larger datasets there. But there’s still challenges that are associated with that. So if we, if we kind of break down challenges that face R users as they start to work with larger and larger datasets, we can look at them in three different dimensions. So the first is that the memory for the machine that you’re working with, it can only process so much data before R crashes. So essentially there’s a limit built into working on one machine. If you wanted to keep getting larger and larger single machines, eventually you wind up with with a mainframe. So that’s one challenge. The second challenge is performance. And that is even if you provision a very powerful instance, you will still kind of eventually see some performance issues when you’re working just with R, without any kind of other special intervention of another technology to help R scale. And then the third one is any kind of application that you’re considering building that will use big data, the value of that app would have to justify the extra energy and investment that would go along with managing that infrastructure. So if you’re going to be building your own platform to do big data analysis with R, or using any kind of approach where your data team has to manage the infrastructure themselves, that kind of sets the threshold higher for what kind of apps you can build, because of that extra overhead associated with that. So wouldn’t it be nice if there is a technology that was available with a familiar API in R that would let your R application scale gracefully? So we’re happy to tell you that there definitely is.

Scale R Apps with Databricks and RStudio

And the way that we’re going to talk about how to do that today is how you can use Databricks for Spark as the scalable cluster compute in RStudio IDE and other RStudio products as the way to kind of develop and share those apps. So what I’m going to go over is the two kind of development patterns for working with RStudio and Databricks. The first is hosting the RStudio server, either pro or RStudio server open source on a Databricks cluster. And the second is where no matter where your RStudio instance is, you can remotely Connect to Databricks using Databricks Connect. So I’m going to go over both of those examples in more detail.

And how does this address the challenges that we mentioned before? So Databricks has auto scaling Spark clusters that will dynamically respond and accommodate larger data that you’re you’re trying to access and process. So if you were running a query with Sparklyr that is larger than the last query, using auto scaling database will provision more resources to be able to accommodate that load and keep turning through the data. The other piece is that Databricks offers the ability to use scalable storage like Delta Lake And there’s also features in the Databricks runtime that can provide faster execution. So the ability to as as your data grows, we shouldn’t hit this wall of performance degradation. Using Delta Lake and using Spark together, we’ll be able to scale the volume to two petabytes. And then the last one is, Databricks is offered as a managed service, so your data team can focus on just building data products, not having to maintain infrastructure. And that can help you move a lot faster. And actually, it lowers the threshold for the value that different apps that you would think of building. So it actually helps you build more at a lower cost, and kind of have more throughput for your team. So with all that being said, let’s actually look at these two different development patterns. The first being hosted, RStudio server pro on Databricks. So in this architecture, you can see your users will be accessing your data lake through a Databricks cluster, wherein the Spark cluster RStudio Server Pro is installed on the driver node. And then through Sparklyr, through that API, you’re able to send commands to the Spark workers and then read data in directly from your data lake and do work on that. So this is a great way to build scalable R applications directly on your data lake open that up to the users, and it’s hosted on Databricks. And I’ll be showing a demo of how you can do that.

So now I’m going to switch over to the demo. And I’ll show I’ll show you how you can easily set this up on Databricks. So let’s take a look at how easy it is to get started with RStudio Server, whether that’s RStudio Server Pro or RStudio Server open source, that is actually hosted on a Databricks cluster. So here what we’re looking at is the cluster UI for this RStudio cluster that I just spun up. And as long as I select DVR Databricks runtime 7.0, then RStudio will actually be automatically installed on the cluster. Prior to Databricks run time 7.0 we would just attach an init script to the cluster under Advanced Options, and all that information is available in our public documentation. We’ll have links to that at the end of the presentation. The other thing we have to do is we have to disable auto termination. The reason why we do this is so that users don’t lose their work by accident if they walk away from their machines for a little bit. Once we have those two things set up, come over to the apps tab, you’ll see a button here to set up RStudio. You’re going to get a unique password for each user each time this cluster is turned on. And then we just sign in, essentially. So I’m going to put my username in here.

And then I’m going to paste that password that I just copied from the cluster UI. And then here we are, this is RStudio Server Open Source. If I brought my own RStudio Server pro license, this would be RStudio Server Pro. If we take a look, where is this? This is on the driver right now. This is installed on the driver of our Spark cluster and this working directory is actually on the driver node of the cluster. So to make sure we can persist our work, just a quick tip, I want to recommend that you set your working directory to a path on DBFS. You can also work with version control and things like that. But this is important because DBFS actually points to cloud storage, even though it’s a path that’s mounted to every single node in the cluster including the driver. So now that I’ve set my working directory to DBFS, anything that I save will be persisted from session to session. Now, let’s talk a little bit quickly about how to actually access the data lake with Spark. So the first thing to do is to create a connection with Spark R to the Spark session. And then we can actually start using Sparklyr to analyze the data. So let’s get our Spark connection with Sparklyr going, use Spark Connect method equals Databricks.

Once I run this, you’ll see that the the connection to the hive meta store will be recognized. And all of these tables are tables that are actually in our data lake. This can be considered looking at the tables in our data lake. Each one of these could be massive, massive tables. And if we wanted to read them into memory, we can directly access them by Spark read table with our Spark connection and then the name of the table that we wanted to access.

And that’s that’s pretty much it. This is sort of how we can have RStudio Server hosted on a Databricks cluster, having access to Spark, having access to the data lake, and we can build R applications that use Spark here. We can also test out and work with Shiny apps that access data in the data lake here.

So now that we’ve seen how RStudio Server is hosted on Databricks and how we can access Spark in the data lake through that architecture, let’s take a look at the second development pattern. In this case, we have RStudio with Databricks Connect. So we’re not actually hosting Rstudio on a Dtabricks cluster. Instead, we’re going to use an instance of RStudio server pro that is remote outside of Databricks. And we’ll still be using Sparklyr to run commands with Spark and access data in the data lake, but the way that we’re actually going to Connect to that Spark cluster is through Databricks Connect. And what Databricks Connect lets you do is by installing the client library on your laptop or on your remote server that has RStudio on it, you can authenticate with a Databricks cluster, and your local machine essentially becomes the driver for the Spark cluster. So this lets you work with your local IDE while still submitting commands that are running on the cloud on Databricks and accessing data in your data lake.

So now, I’ll be showing you a demo of that now. Let’s take a look. So let’s take a look at how you would set up a remote connection to a Spark cluster on Databricks, so that you could develop using your laptop or using your other remote RStudio server instance, but still have access to the data lake and be able to develop scalable R and Shiny apps. So what we’re going to use is we’re going to use Databricks Connect. And from a cluster point of view, really the only change we have to make is in the advanced options, we set one Spark configuration to set Spark Databricks service server enabled to true. There’s documentation on how to set this all up, but that’s pretty much the only thing we need to do in the Databricks side. If we take a look locally now, I’ve set up, I’ve installed the Databricks Connect client on my Mac. And my Mac is now Connected to the Databricks cluster, and it’s functioning as the driver. So I’ve already set up the connection to Sparklyr, I’ve already read in an airline’s data set, pointing to cloud storage and the name of that table. So I’m accessing my data lake, I’m reading in Delta Lake tables into my, but working with them locally. The data is being processed in Spark, but I’m actually able to have the end user experience working on my local machine with my settings, and things like that. So let’s take a look at some of the things we can do here. Sparklyr has wonderful integrations with Dplyr, so I’m actually just going to load Dplyr and then run a count on the number of records in this dataset.

And it looks like we have 1.2 billion records in this data set, and that ran pretty quickly. Let’s do a little aggregation here, let’s take a look at the number of flights by month for every carrier for every airline.

And I should take a look at that. And then you’ll notice when these Spark jobs are running, if they’re longer running Spark jobs, you’ll see a little bar pop up here. And there we go. So we were able to quickly aggregate that data in our data lake over 1.2 billion rows in a matter of seconds. So how does this really work? How does this Dplyr in Sparkler integration work Just as an aside, if you pass this logic to Dbplyr, SqlRender, you’ll see that this actually gets translated into Sql. And this Sql is passed, your Dplyr code gets translated into Sql, and then gets passed to Spark where it’s executed in Spark Sql. It takes full advantage of the catalyst optimizer. And this is one of the ways that you can get great performance. So we’re we’re kind of bridging the gap from R to the power of Spark and all the optimizations that are going on there.

Now, let’s say we actually wanted to do some visualization on this data. A common pattern is to do the aggregations, the heavy data processing in Spark, but then collect the aggregate results back to R for plotting. And this works quite well.

So for developing R apps, or for building Shiny apps, this is a great pattern where you can query the data at scale, and then you can kind of build the aggregates and the visualizations that you want to accordingly.

There we go.

Okay, so just before I wrap up, let’s go back to the cluster UI, and actually take a look at the Spark UI here. So you’ll see that all these commands that were run, they were all executed from my machine. This is my Mac, this is my machine. And you can see that we have this remote connection to Spark on a Databricks cluster. And we are accessing the data lake, much in the same way that we did before when the RStudio Server was actually hosted on Databricks.

And that’s pretty much it. So to quickly summarize, I’ve shown two different ways that you can develop scalable R in Shiny apps on Databricks, both with a hosted on Databricks solution, as well as through Databricks Connect. And now what do we do once we’ve actually developed some of these applications? How can we share them with others And for that story, I’ll turn it over to James. – Hey, thanks, Rafi. As was mentioned, we’re going to be talking now about how do we transition from the development work that we’ve just done to sharing this work in a way that’s easily accessible to other individuals within an organization. And as an R user, a common tool for doing that is the Shiny framework, the ability to create and develop interactive web applications using R. I think it’s useful to take a step back for a second and kind of understand how we’ve gotten here, where we’ve taken this data. This this diagram comes from Hadley Wickham’s R For Data Science book. And this notion of what is the data science process? We take some data, we try to understand it, that understanding is kind of this iterative process of looking at the data, visualizing the data, cleaning the data, and kind of repeating through this cycle until we develop a clear enough understanding that we have something that we’re prepared to communicate. Now in most cases, R users will be using the RStudio IDE for this process of understanding data, but like Rafi mentioned, as the size of data scales and grows larger, just using and relying on R as the computational engine starts to become restrictive at a certain point. And so in this case, what we’re looking at is bringing in Databricks as a back end engine to support this entire process. And once we’ve arrived at the end, as I mentioned before, that’s where Shiny comes in and can be something that’s used as a tool to distribute the knowledge that we’ve gained through the process of understanding the data with other business users who maybe haven’t been directly involved in the data as much as we have, but have similar interest in learning from what’s available inside of that data. Now, it’s important understand that as we’ve gone through this process, we’ve used, as Rafi demonstrated, we’ve used a couple of different ways of combining RStudio Dattabricks. We either have RStudio hosted inside of Databricks or using something like Databricks Connect to allow us to interact with Databricks from a location that’s outside of the Databricks environment. And this works really well for this type of interactive analysis. However, there’s a little bit of a challenge, and you may be familiar if you’re an R user, and you’ve kind of paid attention to the storyline over the recent years, there’s a little bit of a challenge, and that is Shiny as an interactive framework, when combined with Spark as a back end engine has traditionally been difficult to get right.

Shiny and Spark; A cautionary tale

It’s certainly possible to set this up and to make this work. But there’s often better alternatives, or at least other alternatives that should be explored because of the issues and potential challenges that can arise when using Spark and Shiny combined together. Now, as we think about the story of Databricks in RStudio, the question now is, okay, if I’ve gone through the effort of building out my analysis using something like Sparklyr and using Databricks as kind of my computational engine here, what options do I have when I want to transition to something that’s shareable, like a Shiny application? And we’re happy to report that there’s actually another alternative here that’s really quite easy to implement. And that is using ODBC as an interface to Spark. So Databricks provides an ODBC driver to customers, a Spark ODBC driver. And this ODBC driver can be used in connection with the robust tools that already exist in R around ODBC connections. There’s been a lot of work done in the past couple of years to make Connecting from R to external data resources via ODBC very robust, very stable. And those tools can be used to create a connection to Spark using this ODBC driver. The other advantage here is that this connection is, and we’ll show this in just a moment, but this connection has been proven to be just as performant as a native Spark connection that you would have through something like Sparklyr or if you were on the Databricks and inside the Databricks environment itself. And just as a kind of an added bonus here, the migration process of transitioning from a code base that’s reliant on a Spark connection to a code base that’s reliant on an ODBC connection is fairly straightforward. There’s not a lot that needs to go into making that change.

To give you an idea of kind of what this looks like here, as Rafi’s shown a few similar style diagrams, we have on the left hand side here we have RStudio Connect, which is the application, it’s an application that’s used to distribute and share things like Shiny applications and other resources within an organization.

ODBC with RStudio Connect

And here within RStudio Connect, we can use Databricks, either through these ODBC connections, or other client libraries that may exist to provide a connection into a Databricks environment.

To give you an idea of the performance benefits, rather the performance comparison between ODBC and Spark, we ran a few different benchmarks, looking at just general queries, along with some different join types and things like that, comparing the performance of a connection made with Sparklyr and a connection made with this ODBC driver. And in each case, you can see here on the left hand side, we have a diagram highlighting, collecting data from Spark back into an R session. And on the right hand side we see the performance of a similar style operation, but involving some joins between two different tables within Spark. And on the top row, the top row data represents a native connection made through Sparklyr. And then the next two little entries here in both of these diagrams represent two different versions of the ODBC driver that’s available from Databricks. So there’s a current version of the driver, an upcoming version of the driver that we’ve benchmarked as well. And what’s what’s great about this is that as you can see, kind of from the middle entry in both of these plots, the updated driver, which is what that entry is referring to, is just as performant in terms of being able to summarize and aggregate and perform kind of data analytics tasks as a native Spark connection is when when we’re connected through ODBC.

Sparklyr to ODBC

The other piece here is what it takes to migrate between a code base that’s built on top of Sparklyr and then code that’s now reliant on an ODBC connection. And so here I have side by side, two different snippets of code. On the left hand side, we’re using Sparklyr to Connect to a Spark environment. On the right hand side, we’re using ODBC to Connect to that Spark environment. And what’s great about this as if we highlight the differences between these two different pieces, they’re very minimal. The the code that I use to connect changes slightly between the two. And the code that I use to disconnect changes slightly between the two. Everything else remains the same. So the way that I interact with my data, the functions, the techniques that I use to summarize and understand my data remain identical. The only thing that’s really changing is how I choose to connect and disconnect from, in this case, my Spark environment on Databricks. All right, so now that we’ve kind of looked at what this pattern might look like, let’s take a look at a demonstration of how these pieces can combine together to go from development work that’s taking place in RStudio using Databricks Connect to something like a Shiny application that can be widely distributed and is using ODBC to connect and interface with Spark. Okay, here we are in RStudio server Pro, we have connected to Spark inside of Databricks using Databricks Connect. We can see that here, we’re using Spark Connect with the method defined as Databricks. And we’ve set Spark home to be the variable that’s defined by Databricks Connect when we run that from the command line. And once we’ve done this, we can see that we’re connected to Spark up here in the top right hand corner. And here we can do our interactive analysis. So we can pull in a table or we can point to a table in Spark, we can figure out how many records are in that table if we want to. In this case, we’re pointing to a collection of flights from 1987 to 2008. We have 167 million records in this collection. We can look at the first few rows here if we want to get an idea of what’s contained. We’ve got the month and the day of the month. We’ve got here at the end. We’ve got the year, the airline, and a bunch of additional information around the flight, how long it was in the air, whether it was delayed the origin the destination, the distance that was traveled, so on and so forth. And then we can do our regular kind of data analysis type tasks. We can check to see how many records there are or how many flights there are per year. And then one of the nice things is we can perform this summary, which is just going to return to us 22 rows, and then we can collect that data into R session. So that’s what this collect function does. And that makes this year accounts object a local data frame containing these 22 rows of information. And now once we’ve got that, we can then plot this information using our standard R toolkit. And again, kind of this notion of going through and understanding and developing our intuition around the data set that we’re working with. One of the nice things this notion of visualization being a powerful tool for understanding data, but in this case, again, we’ve got way more data than we could reasonably bring into our session. We’re not just going to pull everything from Spark into R. We want to keep as much data as possible in Spark, allow the execution to take place there, and then when we’re ready, we’ll bring the summarized data and the summarized pieces back into R that we can then use for further either visualization or analysis. And something that I think is kind of interesting, there’s a neat package called DB plot that will actually push the summary computation for various summary plots and visualizations to a back end system such as Spark. And so in this case, if I wanted to view the distribution of departure delay times by airline, instead of bringing all the data, again, I’m trying to avoid bringing all hundreds of millions of records into R, instead, I can push all the computation to Spark, allow Spark to do all the summary that it needs to do in order to create this box plot, and then this DB plot, box plot function will create and render this GG plot object based on the summary that comes back. And so we can iterate through and start forming an analysis, we can start to perhaps build some sort of a hypothesis here. Maybe we want to compare airlines and see some differences that exists, we could look at Southwest and Delta, and then we could calculate for each of those airlines, how many flights did they have for each month in each year in our data set. And then if we wanted to, we could take that information, and we could plot it. And we’ll get some visualization that looks kind of like this. And this, for me just kind of going through this, this was pretty interesting, at least in the data set that we have, we can see how Delta was more prominent in the 1980s and 90s, but then toward the mid 1990s Southwest started to catch up in terms of total number of flights. Again, this is just the flights represented in our data set. And then here in the early 2000s, we see that Southwest starts to become the more prominent airline in the data set that we’re currently working with. So kind of an interesting thing that we can see here. And at this point, we might think, Okay, well, this is maybe an interesting way of looking at this data, is there a way that we could kind of open this up, so that others could take a look at it? And as was mentioned kind of earlier, what we can do is we can transition from using the Spark connection to using ODBC. And that would allow us to build like a Shiny application on top of this data, so that other users and other individuals within the organization could come in and view this data and work with it. So that’s what we’ve done here, we’re using the ODBC package, we have a DSN defined called Databricks that points to this Spark environment. And then I’m using the pool package just so that we can more intelligently manage connections within a Shiny application. And then after that all of the code here is essentially the same. We’ll see that I’ve got my plot data defined here. And this just is the exact same thing that we looked at before. Instead of pointing at my Spark context, I’m pointing at this pool connection that I’ve created. But other than that, everything here is the same code as what we saw previously in our exploratory analysis. And if we run this application, what we’ll see here is we’ll get a version of this that will pop up here in just a second. And this application that we have allows us to go through and investigate kind of what we were already investigating in our markdown document. We can select a couple of different airlines. It takes a minute here to initialize the connection. But we can investigate a couple of different airlines. Or we could come in and select additional ones if we wanted to. So if we wanted to throw in Alaska, I used to fly Alaska all the time, we can throw Alaska in here, we can hit go, this will now rerun the query. So it’s submitting a new query to Spark, Spark is executing that query, aggregating the data, millions of millions of records are being are being evaluated. And then once those records have been evaluated, the results are returned back into R, and then rendered here in our Shiny application. And so we’re looking at, in this case, 35 million records, an average flight time of just over an hour and a half, average departure delay of eight minutes for these three combined airlines. And then we can see comparatively how their number of flights compares across time in this plot right here. And now that I’ve got this running locally, I could even go so far as to publish this somewhere too, like RStudio Connect. So I could come in here, click publish, specify that I want to publish this to Rstudio Connect, go ahead and publish this. I’ve already done this. And so if we come in here and take a look and open up RStudio Connect, here we have a published version of this document. In fact, if I just open this up here’s a URL that anybody could go to and view this particular application. And we have the same thing available here. So the way that that’s worked is RStudio Connect has picked up on all of the dependencies of my project, it’s made those dependencies available, and then I have defined an identical DSN on this particular server, so that when I’m querying this data, it knows where to look for the data. And again, in this case that data is in Spark, hosted by Databricks. So I can come in here, I can select American Airlines, I can hit go. Again, this is going to submit a new request back to Databricks, Spark’s going to execute on that request. When the results are available they’ll come back into my R session. Once my session receives those those results, they’ll be made available here in this dashboard. So there’s a little bit of a lag, just because of how many records we’re considering and what’s going on. But all things considered, this pattern works really, really well if I have a lot of data living inside of Databricks, and I want to be able to leverage it from within something that’s interactive, like this Shiny application, and now that I have this built, I could easily distribute this within my organization so that others can come in, ask questions of their own, you know, maybe I want to know how specific airlines compare or for a specific timeframe, maybe I’m only looking from 2000 onward. And we can come in and give them the ability to execute on this incredible amount of data without needing to necessarily worry about how that execution is taking place. And also, without me needing to worry about how I might get all this data from Databricks to some other environment. I can keep the data in Databricks, I can allow Spark to execute on that data, and as an R user, and as an R developer, I have all the tools available to me to be able to take what Spark has given me, the results that I’ve gotten back from Databricks, and render those here as part of my Shiny application, such that other users within my organization can come in, ask questions, and get to the answers that they’re particularly looking for.

All right, so now that we’ve taken a look at this, let’s just kind of conclude by by taking a look at some best practices that exist, because I think it’s important to understand when to use the tools that we’ve discussed here in our time together. On the left hand side, this is development work. So I think of development work as kind of interactive iterative work that’s often being done inside of a development environment. For example, something like RStudio, where I’m trying to understand my data. And in this case, using a Spark connection managed through Sparklyr and Databricks Connect or using a Spark connection while running within the Databricks environment are great options. Along with the ODBC option that we’ve already talked about, all these can be used to to understand and work with, and understand data. As you start kind of advancing or looking at more advanced techniques within Spark, like building out machine learning models, and doing some more advanced analytics, a native Spark API, something like Sparklyr becomes a moreattractive option, because those more advanced operations are often exposed through the ODBC interface. And so as you work through development, using Sparklyr is often a great way to start. And again, you can do that using Databricks Connect as we’ve outlined here, or natively from within the Databricks environment. On the other side of the story, when we go to deploy this, like we did with Shiny,

deploying something that now provides some sort of interactivity and a user interface layer on top of something like a Spark back end becomes a little bit more involved, because there’s some additional considerations to be made. And for that reason we recommend if you’re going to build out Shiny applications, and you want to leverage Spark as a back end computational engine for those applications, the ODBC route is a very stable, very kind of robust route to follow. Now, if you’re looking at deploying other things beyond just interactive applications, for example, you’re building and training machine learning models and you’re looking for ways to deploy those, we’re actually still in the process of figuring out what the path forward there looks like. There’s a few different options, ML flow provides a great option for deploying and managing models. There might be ways of using the Databricks API natively to manage and work with models that have been deployed. Rafi has a has an R package that he’s been working on called Brickster that provides some R functionality around submitting jobs to Databricks and interacting with the Databricks API from an R user standpoint. And so there’s still some work that’s being done to determine how do we best deploy things that aren’t necessarily interactive like Shiny, but rather maybe a machine learning model or something like that we’ve trained inside of Spark. How do we now transition that and make that more widely available? And that’s something that that we’re continuing to look at and provide kind of best practices around. So stay tuned for further updates there. And then if you’d like to learn more, there’s a collection of different resources you can visit. There’s, you know, on the left hand side here, we’ve got links to different pieces of documentation about getting RStudio set up inside of Databricks, using Databricks Connect, different pieces around ODBC and getting that all configured, as well as some general guidelines around what RStudio Connect is, and what Sparklyr is, and things like that. On the right hand side, there’s a link to the contents that we’ve used in this particular talk. So that includes all the code that we’ve walked through, and then some kind of generic outline instructions that provide some additional guidelines around getting things set up. Rafi’s package is over there, as well as a couple of additional repositories that you can keep an eye on to see what changes are coming, and everything like that. But we appreciate everybody’s time in joining us today. As always, please remember to provide feedback for the session. That feedback helps us and helps Databricks and the folks that have organized this to know how to continue to provide the best possible experience. And we appreciate your time.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About James Blair


James is a Solutions Engineer at RStudio, where he focuses on helping RStudio commercial customers successfully manage RStudio products. He is passionate about connecting R to other toolchains through tools like ODBC and APIs. He has a background in statistics and data science and finds any excuse he can to write R code.

About Rafi Kurlansik


Rafi is a Sr. Solutions Architect at Databricks where he specializes in enabling customers to scale their R workloads with Spark. He is also the primary author of the R User Guide to Databricks and the bricksteR package. In his spare time he enjoys gardening with native plants, cooking up a storm, and long video game sessions with his three children.