Deep Learning at Scale with Apache Spark and Determined

Download Slides

Despite its enormous potential to enable new applications, deep learning remains prohibitively expensive, difficult, and time-consuming for the vast majority of companies. Training DL models at scale is particularly challenging: training a single model can take days or weeks, and DL engineers are often forced to spend much of their time doing DevOps or writing boilerplate code to handle routine tasks like data loading, distributed training, or fault tolerance.

In this talk, we introduce Determined, an open source platform that enables deep learning teams to train models more quickly, easily share GPU resources, and effectively collaborate. This talk will include an overview of the problems that Determined aims to solve, the high-level architecture of the system, and show how Determined and Spark can be used together effectively. We’ll also dive deep on some key technical features, such as:

  • Distributed training without changing your model code
  • Intelligent hyperparameter search
  • Flexible GPU scheduling, including automatic management of cloud GPU instances
  • Automatic fault tolerance and checkpoint management
  • Seamless integration into the Spark ecosystem, e.g., for performing ETL or model inference.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Great, well welcome to our talk, which is called Deep Learning at Scale with Spark and Determined. I am Neil Conway, and I’m joined by my colleague David Hershey. So Determined is a deep learning training platform. And the goal of the product is to enable teams of deep learning engineers to train better deep learning models in less time to collaborate more effectively, and to seamlessly share GPU resources. So in this talk, we cover two main topics. First, we talk about what Determined is and why you might need a training platform. And second, we explore how you would integrate Determine into the rest of your machine learning workflow in particular, and the Spark you use if for other parts of your machine learning workflow, how Determined can be integrated with those other components.

Typical Deep Learning Workflow

So let’s take a step back and present maybe a very typical slide that you might see on a lot of deep learning talks, which is what are the steps involved in doing Deep Learning at Scale? So you typically might start with your raw training data, labeled examples that are appropriate for the problem that you wanna solve. And then you have some kind of transformation to take that raw data and transform it into a more appropriate format for doing machine learning, you might store that transformed data in a data lake. Then comes the actual model training development process where as an ML engineer, you explore different model architectures, different data representations, different optimization techniques, to try to get a model that works well on the problem you’re trying to solve. And then once you have an effective model, you deploy it to where you wanna do inference. So you might be doing real-time serving, or batch inference, or serving on a mobile device.

So if we think about those as kind of the basic steps of doing deep learning today, how do those aspects of that workflow map to the modern software ecosystem?

The advanced slide, okay.

So once you have your raw data, you wanna transform it to a data lake Spark is a great tool for doing high performance ETL at scale. And then once you’ve transformed your data, and you need to know where to store it, something like Delta Lake is a very good choice for a mutable version storage of deep learning, training sets. If we switch over to the serving side, when we think about how we wanna do real time inference, there are several real-time model serving, there are several good machine learning serving packages, Seldon or TensorFlow serving are two examples. And then if we wanna be inference Spark again, is a great choice for doing batch inference at scale. And that really kind of returns us to the middle part of this diagram, which is what does our model training and development environment? What does that look like? What are the requirements for doing that well? So I’ll make the case in this talk that Determined is a good invitation at that middle part of the deep learning workflow.

The Determined DL Platform

So as I mentioned, in this talk, kind of two main parts will go into a bit more about what a deep learning training platform ought to do. And some of the features of the platform we built at Determined, and some of the benefits that deep learning engineers can see by using it. And then the second part, we’ll walk through a demo of how you integrate Determined into the Spark ecosystem. So in particular, how you use it with Delta Lake, and Spark on the input side, and how you use it to do batch inference with Spark on the output side.

So if we think about the problems that we’re trying to solve with a deep learning training platform, I think an actual question is, aren’t those challenges of how I trained deep learning models aren’t those really solved effectively by tools like TensorFlow and PyTorch? And I think the tools like TenserFlow and PyTorch are great, great tools, and they’re very effective at what they’re trying to do. But ultimately, those tools are designed to enable a single deep learning engineer, to train a single model, on a single GPU, or a handful of GPUs.

But as your team size, as your cluster size, or as your data set size, all increase, we find that teams that are in that situation frequently run into challenges that are really outside the scope of tools like TensorFlow or PyTorch. So for example, as your team grows, as your cluster size grows, you might move from having one ATP a machine that a single researcher uses to many GPU machines, like a whole cluster. How do you share that cluster effectively among a team? As you get larger models and more GPUs, using many GPUs training one model becomes an increasingly important challenge. How do you do distributed training? That’s something that’s not well solved within those existing packages. Similarly, how do you manage your models? How do you manage your metrics and training data? And especially as you go from just training a small number of models in a research setting, to training many different models and pulling those model’s production, metrics management, and model management becomes a much more important concern. Similarly, tools that tasks that you might do manually, like hyperparameter search at small scale, suddenly become increasingly painful if you’re training many different models and becomes increasingly important to automate those tasks with integrated support for hyperparameter search in an efficient way.

So in our view, there’s really cause to have a training platform as a kind of piece that sits between your application frameworks, tools like TensorFlow, PyTorch, and Kerris, and the hardware that you’re doing training on. So that might either be an on premise GPU cluster, or a cloud environment with cloud hosted GPUs on, cloud providers like AWS, GCP, or Azure. And again, the goal of that training platform is to work with your existing data. So Determined supports data in a wide variety of formats. And to enable models trained within the platform to be exported to the environment where you wanna use them to do inference. But really what it’s doing internally is allowing the deep learning engineers on that team to work together more effectively, to train models more quickly, to collaborate, and to spend their time really doing deep learning. And kind of training better models, and spend less time on DevOps, or on writing boilerplate code for a common tasks like fault tolerance, or distributed training.

So just a little bit more detail into the kind of functional areas within the architecture of this system. You can think of Determined as essentially a kind of integrated system that combines job scheduling, and GPU management, distributed training, hyperparameter tuning, and experiment tracking, along with visualization, all into kind of an integrated package. And the idea is to enable to design each of those tools in a way where they work together smoothly, so then as a deep learning engineer, you can get access to the best of breed functionality in these areas. And really get to spend more of your time on the problems that you care about deep learning and less time, kind of pulling different packages together to form your own ML platform.

So if that’s roughly what the architectural system looks like, what benefits do you see as a user? So wanna talk about kind of three key areas where we think that Determined is able to enable you deepen teams to be more productive than they otherwise would be. So when it comes to hyperparameter optimization, that’s something that is kind of built deeply into the platform. So when we initially built Determined, we started immediately by thinking about when you wanna do hyperparameter search, what’s the right system support for that? What’s the right GPU scheduler support for that? What should fault tolerance and metrics management look like? If hyper parameter search is gonna be given the ability one of my co-founders at Determined is Mutola Carr, who’s a professor at Carnegie Mellon and his research group have invented a number of methods for very efficient hyperparameter search algorithms. And hyper bandits, probably the best known of those, and those methods are directly integrated into the product. And what we found is that using really intelligent search methods, and then designing a system around them leads to very efficient hyperparameter search. So in our experience, we can find high quality models up to 100 times faster than standard methods like random search, and up to 10 times faster than research methods based on Bayesian optimization.

Second key capability is distributor training. So in Determined, you configure distributor training or configure the number of GPUs you wanna use to train a model, just by changing the configuration setting, you just tell the system I wanna train this model with 116 or 64 GPUs. And Determined takes care of scheduling that task on the cluster, orchestrating their supervision and operation, actually doing the gradient updates and so on ensuring that this whole process is fault tolerant.

As a user, you don’t think about any of those manual operations. So I think that has two benefits. First, a distributor training becomes a tool you can use much more easily rather than setting up a tool like Horovod by hand every time you wanna use it, or figuring out how to use it in a multi-user setting, this is kind of a built in capability you can use whenever it’s helpful. And second, we have a bunch of optimizations on top of stock horovod, that lead to significant performance improvements.

Determined: Key Capabilities and Benefits

Lastkey capability are basically, a collection of tools that we provide for deep learning engineers that allow them to focus on doing deep learning, rather than managing infrastructure or writing boilerplate code. So some of those things are every workload within our system is automatically fault tolerant. So as you’re running these workloads that might take days or weeks to run, we take care of checkpointing models and periodically, if any failures occur, we restore the state of the model and resume training for the most recent checkpoint. We’ve built the experiment tracking visualization, we have a job scheduler that fair shares the cluster so that multiple users can very easily share a GPU cluster and make progress at the same time. And we really simplify GPU management both on premise and in the cloud, where we can provision GPU instances for you automatically.

So that was a kind of a quick overview of what the Determined is, what a deep learning training platform is, and some of the key capabilities of the system. Now I’ll turn turn things over to my colleague, David, who will give you a demo of how to use Determined with the Spark ecosystem. – Thanks, Neil, I’m excited to go through this demo, it’s really showing how Determined could fit into an end to end ecosystem for machine learning. And what I mean by that is being able to build sort of reproducible pipelines, where everything from what data you used, what model you used, and how you trained that model, how you did inference that’s all easily reproducible, trackable, and really seamless. And so, what we’re gonna be doing today is some amount of ETL using Spark to land image data into Delta Lake. That’s going to be really useful for us to be able to version our data in Determined.

And so we’ll be able to see exactly what version of the dataset we use to train our model at Determined. We’re gonna focus a lot on Determined and how we can use it to scale our experiments beyond as Neil said, sort of the single GPU, single researcher scheme to working with a team on a whole GPU cluster, and the really cool things you can do when you have a cluster for deep learning. And finally, we’re going to show how Determined can sort of manage the artifacts of training to create what I’m gonna refer to as a versioned model. And you can then use that versioned model to export it and do inference, in this case in batch using the Spark ecosystem again.

The example we’re gonna be using for this is a object detection model. In particular, the dataset we’ll be using is the VOC 2012 data set if you’re familiar. And so object detection is basically trying to draw bounding boxes around objects in an image, this is very much a deep learning problem, at least right now. All of the best methods are deep learning. And so we’re really gonna run into a lot of the difficulties that Neil described when he was talking about teams doing deep learning. For this demo, I’ve landed all of these images into a Delta table. Basically, we had them as raw images in S3, and I used Spark to land them into a Delta table, where it’s images, file names, and annotations, which is basically all of the bounding boxes, and some other information about that image. And first, before we dive into Determined, I wanna talk about what this would look like without Determined. If you just wanted to build an object detection model with sort of the standard tools out there. And so this is my version of what that looks like. And it’s worth diving into, ’cause it’s still pretty cool. This is a deep learning model I’m training it using PyTorch I built a faster R-CNN model with a resonate backbone in PyTorch training it in this notebook.

The models here did some cool stuff to pre-process the data, load the data, wrote a fancy data connector that can connect to a Delta table, load that data, creates a PyTorch data set that we can use to do all of our pre-processing and that kind of thing. Training the model here, saving checkpoints, doing some cursory amount of logging to keep track of things like mean average precision, and to see our progressing through training. And what I wanna really get at is this works fine, especially for one person, I’m running on a fancy V100 GPU, so this is running as fast as you can on one GPU, saving the checkpoint. If I train this model, I can copy that file somewhere, paste it somewhere else, use it for Spark inference, say something like that. And it’s all possible.

But the core of what Determined does is makes it easy to take that and scale it to do larger scale experiments, where you can iterate more quickly, and also work way more effectively with a team on this kind of thing. And so for the sake of this demo, the situation I want you to imagine is that my co-worker actually already built an object detection model, on this data set. They built the model, they trained it, they used it even to do some sort of inference in Spark, so that we could score some data for whatever use case our company has, and that got done. And then a month or two or five went by, and my manager came to tell me that I needed to take that model, and try to find some way to make improvements. They said, if we can improve it by just a few percent, it’ll be worth a lot of money to the company. So what can you do? And so that’s the story we’re gonna tell and we’re gonna start by looking at how my co-worker trained the model, in the first place. And the first thing I wanna imagine, you’d imagine is if they trained the model this way without Determined, in the notebook, the extent of their logging is saved in this notebook, the checkpoint hopefully got saved somewhere, but maybe I don’t know, where.

I don’t really know what hyperparameters they used, hopefully it were these ones. But if they were trying out a bunch of stuff, I don’t have any way to track that, or associate that with the model they did. To understand their code, or to run their code, I need to figure out what it’s doing. I need to figure out what script they ran, I need to figure out what environment they constructed to run it. It’s really, really a nightmare. And this comes up a lot at a lot of companies to try to run this kind of code without the right tools. Now, luckily at my company, my co-workers that had trained this model used Determined to train the model. So what that means is they wrote all that same model code, but instead of, coding up some experiment they basically told describe to determine the experiment to run with this config file that I’m showing you right here. So it’s worth diving into this config file to see what they were doing. The first thing we’ll see is they were loading data and this is cool. They were using a connector to Delta Lake to load a version of data for both training and validation. So they loaded their version zero of the data. They’re doing a really cool hyperparameter search, the one that Neil described to you in our overview, that’s searching over simple in this case, but learning rate and momentum, to really popular hyperparameters. And so they’re gonna be doing a search over those to find a good configuration.

They’re doing this with the adaptive search algorithm that Determined has. This is based on the popular hyper band algorithm. It’s actually one of an algorithm that one of our co-founders invented at Determined and this hyper band algorithm is a state of the art hyperparameter tuning algorithm. It uses principled early stopping, which means it tries a whole bunch of hyperparameter configurations all at the same time. But the ones that aren’t performing well, it kills off very early, so that the more promising hyperparameter configurations, have more time to train. We’ll dive into this experiment to see how that actually looks. But leave it to be setting out that it’s a really sophisticated hyperparameter search.

So just because our co-worker sent us this file, they wouldn’t even need to, but just from this file, we can see exactly what they did. They trained the model on some version of the data, they did a sweet hyperparameter search. And let’s flip over to Determined to figure out exactly what that ended up looking like. So this is that experiment in Determined. And what you’ll see is, the first thing you’ll notice is the performance of the experiment. This is the evaluation made metric that tracked over time throughout that experiment. So mean average precision is this metric. It’s very commonly used to evaluate the accuracy of object detection models. And you’ll notice they got a MAP of about point five two, which to just take an aside, I wanna point out for science, I actually ran this with the recommended hyperparameters for faster our R-CNN the recommended values of learning rate and momentum that the authors of the paper suggested. And I actually only got point four five mean average position. And so just right out of the bat, by using Determined, our co-worker was able to build a model that was way better than they would have gotten if they just sort of used the stock option. So hyperparameter search already got us a long way to a really high performing model, when my co-worker trained this model, say a month ago. Now, coming back to myself, and now this is a really helpful view for me to understand exactly what happened. First off, I can look and see exactly what configuration they used. You’ll notice this one to one mirrors the one that we were looking at in this notebook. So no questions about how they train this model, we know exactly the configuration they used here associated with this experiment page. Further, we can actually download the code they used to train that model so I can get an exact copy when they ran this model of the code that was used. And so there’s no question marks of what version, what code was used, what configuration was used, I should be able to take those things, and recreate this experiment right away, if that’s what I want to do.

So I can also click into any one of these. So these are all the configurations that that hyperparameter search tried. You’ll notice that many of them only ran for like three training steps, the ones that weren’t doing very well, but the more successful ones ran for the full duration of training. If I click in, I can see exactly what hyperparameters were used. Like the learning rate and momentum that resulted in this much better configuration. I can click into these checkpoints and see exactly where those checkpoints are being stored in case I need to grab that to go retrain a model, or fine tune a model, or do inference on there’s no questions about where on this file system I saved this checkpoint.

If I go back here so now’s where I get to start sort of critically doing my job, I was asked to say, “How can I improve this model?” So my first instinct is to hope that they provided me some good logs, that help me really get some more insight, into how this model’s performing. And so what I’m gonna do is open up a tensor board, we in Determined directly support tensor boards. So if you do any extra logs or any add any logs in your model, we make it so it’s easy to open up a tensor board and check those out. And lucky me, it looks like my co-worker happens to have made there be

a per class validation metrics. So I can actually look at how accurate the model is on each of the classes, and hopefully use that to try to diagnose some of the issues with the model. And flicking through this, the one that really sticks out to me is this dog class, you’ll notice that the accuracy on the dog class is ranging somewhere. Depending on the model between like point three, five and point four, five, mean average precision. And that’s not great, especially for a class like dog where I’d expect there to be, there are millions of pictures of dogs on the internet. I’d expect there to be lots of labeled examples of dogs, I’d expect that we can build a model that did better than this level of accuracy on dog. And so here’s my aha, something seems wrong with our ability to detect dogs. Let’s try to figure out what’s going on. So flipping back to this notebook, I can go in, and visualize this data set, and try to get a good sense of why this is happening. And when you look at the number of examples of each class we had in that V zero of the dataset, you’ll notice we only had like 50 images of dogs.

And that’s compared to thousands of images of people and hundreds of images of cats.

There’s our probable reason for why we weren’t doing very well in the dog class. So, most of the time, the best answer to not having enough data, is to go get more data. So I’ll turn around and go to my data team ask for more labeled images of dogs hopefully, I have a good team that will help me out here. And then say, they come back and provide me with this new version of the data set. And look at that we’ve got 500 dogs now, this is where the Delta connection really becomes useful. We got those new images, I could run all of that, same ETL code to land them into that Delta table. But now I have a very clear second version of that data set. And hopefully, I can really easily use that in Determined to kick off a new experiment, but with all this sweet new data that I’ve got. So that’s what we’re gonna do. As a data scientist, I’m trying to improve this model

I want the minimum amount of work to plug in this new data, train it really really quickly so I can iterate quickly, and hopefully see improvement. And you’ll see very quickly how Determined makes that very easy. First off, to change the data version, all we’re gonna do is change the version in this config file, you’ll see it was version zero before up here.

And we’re just gonna bump it up to version one and just like that will be connected to a new data set that we can use to train with no questions about what data we’re using it’s very obvious, we’re using the new one.

We’re going to, I think that

the other really cool thing that Determined is gonna do for us is,

let us do distributed training really easy. And what I mean by that is, if we’re doing this up here method of running in a notebook, it’s really pretty slow. It’s gonna take hours, days maybe depending on the problem. We wanna iterate quickly turn this out so we can go work on the next model and be done with our co-workers model that we didn’t really wanna work on in the first place. And so, distributed training is one of the fastest ways to do that to accelerate your training. And in Determined, all we need to do is specify that we want 12 slots per trial. And what that means is we wanna distribute our experiment across 12 GPUs. We didn’t have to go in and re-outfit our co-workers code, with distributed training code. We didn’t need to figure out how to do distributed training on some arbitrary cluster, we didn’t need to find three machines with 12 GPUs. We simply told Determined, we wanna experiment with 12 GPUs. And in term is gonna handle all that grunt work of kicking off a job like distributed across 12 GPUs. So to summarize, we’re going to use new version of the data, we’re going to fine tune based on the output of the experiment that our co-worker did that hyperparameter tuning experiment, and we’re gonna do it using 12 GPUs to do it really quickly. So if we quick kick off that job, sending it to Determined, and we flip back to Determined. If we switch to the dashboard, what we’ll notice is that job got kicked off, and we’re off the races. Well, this initializes and starts training to speed things up for the sake of the demo, I ran this experiment ahead of time, just so we could check out what it looks like.

And so if we go here, we can see that we train this model for what amounts to something like five epochs over the new data.

And first things first, we immediately got like a 1% lift in mean average precision. So 30 minutes of training later, we’re already seeing a percentage point accuracy gain, which as our boss told us, that could mean you know, huge change it or, be very valuable for the company. In the configuration we can very clearly see that this model was trained with a new version of a data. So there’s no question marks if my boss comes back and says, “What happened? “How does the model get better?” It’s easy to go in and say, “Well, we trained it “on this new version of the data.”

More importantly than that 1% bump in mean average precision, if we go look at the dog class, we can see that our accuracy on the dog class has actually jumped to something like point five, nine, you remember before it was something like point four, five. So we got a huge, huge bump, like a 50% jump in performance on the dog class, which is exactly what we had hoped for collecting all that new data.

So now we did it. Like we accomplished our task we did it in like 30 minutes by simply changing this config file, specifying a big distributed training job with new data. And now we say we wanna use it to actually do some kind of scoring, do some kind of inference, the thing that’s actually going to generate business value for the company. Well, before I say our co-worker wrote this code to launch inference using Spark, and the cool thing about Determined is it makes it real really easy to just specify some experiment ID and get a checkpoint out of that experiment, instantiate a model, and use it wherever you need to use it. So they wrote this inference code that just takes a Determined experiment as an argument, takes the location of some new data that we wanna test on, and then launches a big Spark inference job to quickly do inference. And because Determined tracks all of this important information, what the artifacts, the checkpoints, all of that, by simply bumping this from one to 26. Or in this case, we had launched this new experiment on a 31. If we simply bumped this to experiment 31

we can launch Spark inference using that new checkpoint, the one trained on the new data seamlessly, and quickly use Spark to do inference over a huge data set.

This is so useful because not just you can change one number in a Jupiter notebook, but you can integrate it into a workflow tool like airflow, or some sort of CICD solution to make sure that you’re delivering the latest and best versions of your models to do inference, whenever you’re doing it.

To summarize this demo showing how you can go from versioned data in Delta Lake, scale your experiments, collaborate on models, and train new, better models more quickly than would ever be possible without Determined. And then use DEtermined’s ability to keep track of the artifacts of those experiments to do inference quickly, using whatever your infrastructure is, such as Spark. I wanna remind you that all of this code is available on GitHub, including the actual inference, and ETL code, that it does use Spark. So if you have any curiosity about how we use Spark to load a checkpoint from Determined and actually do inference, I highly recommend going to check that out. And otherwise I appreciate you checking out the demo and more important than anything, I hope you’re saying safe and sane. And I hope Determined can make your life a little bit easier if you’re doing deep learning.

– Great. Thank you, David. So that concludes most of what I wanna tell you in this talk. Last thing I wanna to share with you is that Determined is open source. So we’ve spent the last few years working with cutting edge deep learning teams and really evolving the product based on really great feedback we’ve gotten from teams doing deep learning at scale, in a variety of industries. In the last month or so, we are really excited to have released the product as open source under an Apache License. And we would love to have people try it out, give us feedback, and explore whether it’s a good fit for your teams and your deep learning use cases. So you can find us on GitHub, all the documentation is online, we have a community slack channel that we’d love to see folks join and interact with us. So thank you very much for your attention and we’d love to hear questions.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Neil Conway

Determined AI

Neil Conway is co-founder and CTO of Determined AI, a startup that builds software to dramatically accelerate deep learning model development. Neil was previously a technical lead at Mesosphere and a major developer of both Apache Mesos and PostgreSQL. Neil holds a PhD in Computer Science from UC Berkeley, where he did research on large-scale data management, distributed systems, and programming languages.

About David Hershey

Determined AI

David Hershey is a solutions engineer for Determined AI. David has a passion for machine learning infrastructure, in particular systems that enable data scientists to spend more time innovating and changing the world with ML. Previously, David worked at Ford Motor Company as an ML Engineer where he led the development of Ford's ML platform. He received his MS in Computer Science from Stanford University, where he focused on Artificial Intelligence and Machine Learning.