In the last few years, deep learning has achieved dramatic success in a wide range of domains, including computer vision, artificial intelligence, speech recognition, natural language processing and reinforcement learning. However, good performance comes at a significant computational cost. This makes scaling training expensive, but an even more pertinent issue is inference, in particular for real-time applications (where runtime latency is critical) and edge devices (where computational and storage resources may be limited). This talk will explore common techniques and emerging advances for dealing with these challenges, including best practices for batching; quantization and other methods for trading off computational cost at training vs inference performance; architecture optimization and graph manipulation approaches.

– Hi there, everyone. Welcome to this spark and AI summit 2020 online presentation, scaling up deep learning by scaling down.

I’m Nick Pentreath. I’m @MLnickk on Twitter, GitHub and LinkedIn. I’m a principal engineer working at IBM Center for Open source data and AI technologies or CODAIT. I focus on machine learning and AI open source applications. I’ve been involved with the Apache Spark project for a number of years where I’m a committer and PMC member on a project. I’m also author of Machine Learning with Spark.

Before we begin, a little bit about CODAIT or the center of open source data and AI technologies. We’re a team of over 30 open source developers within IBM. And we work on open source projects within the data and AI space that are foundational to IBM’s product offerings. We focus on improving the enterprise AI lifecycle out and open source and this includes the Python data science stack, opened exchanges for data and deep learning models. Apache Spark is a big part of that. TensorFlow and Pytorch as deep learning frameworks, AI ethics, Kubeflow and open standards for model serving and machine learning.

Today, we’ll start with a deep learning overview, and some computational challenges involved in training and deploying deep learning models. And we’ll look at three main ways of solving these challenges from improving model architectures through to model compression, and wrap up with model distillation. And we’ll start by taking a look at the basic machine learning workflow which starts with data.

We typically analyze that data. And then we want to train a machine learning model on it, but it does not arrive in a form that is amenable to training a machine learning model. Typically arrives in raw form, and it’s not in a format of tensors or vectors. So we want to apply some pre-processing to that data. We want to transform it and extract features and then it will be ready in a form to train our machine learning model. Once you train the model, we then deploy to a live environment, where we’re predicting on new data as it comes in, and then either the machine learning model is creating new data for itself as part of the process or new data is arriving externally. This closes the loop and turns this workflow into actually machine learning loop. Now within this workflow, the three areas are the most computationally intensive of data processing, model training, and deployment.

So deep learning is a form of machine learning. It’s been around for quite a while in fact, the original theory dates back to the 1940s and some of the first computer models from the 1960s. As you can see here, on the right is an old perceptron neural network machine from the 1960s. It fell out of favor however, in the 80s and 90s, in large part due to the inability to be successful on real world applications and a lack of compute resources for actually training these models. But recently, there’s been a large resurgence due to bigger and better data sets with all the proliferation of mobile devices and internet scale data collection, as well as standardized data sets, such as image net for various competitions. That combined with better hardware in the form of GPUs, and now specific deep learning hardware such as TPU’s, tensor processing units and improvements to algorithms architectures optimization techniques, has led to brand new state of the art results in domains from computer visions, through to speech recognition, natural language processing, language translation and many more.

So modern neural networks are typically called Deep Learning because they’re made up of multiple layers and they become quite deep and complex. In computer vision, the core building block is convolutional neural networks. And these have been very successful in image classification, object detection segmentation. For sequences and time series, we have machines, recurrent neural networks that has been successful in machine translation, text generation and other natural language processing tasks. And again, an NLP word embeddings and brand new models in the form of transformers with attention mechanisms have achieved state of the art results. The final piece of the puzzle is modern deep learning frameworks, which encapsulate a computation graph abstraction, provide auto differentiation capabilities, high levels of flexibility and GPU support. And have allowed both researchers and practitioners to advance the state of the art in deep learning models quite rapidly.

Now until recently, the amount of compute required for these types of models has roughly followed the Moore’s law to your doubling timeframe. But in the modern era, more and more complex models have led to the computational requirements for these models doubling every three to four months. Clearly, hardware is not enough to continue the advances and we need to really think about efficiencies on the software side here. So we’ll take an example of image classification and work with this for part of exploring new architectures for efficiency. So image classification starts with an input image, we pass that image through a complex deep learning model. And then we get a prediction, a set of predicted classes.

A popular model for image classification, and a recently fairly standard out model is called Inception V3, which we see a small part of it on the left.

And the core building block for these types of models is this convolutional block, which is typically a convolution operation normalization and some activation function typically rarely a rectified linear unit. And we can effectively represent this computation as added matrix multiplication. And a complex modern network such as inception, is made up of many of these convolutional blocks.

An Inception V3 in particular has roughly 24 million parameters in total, and achieves almost 79% accuracy on ImageNet.

So we can see here a set of recent models and the scores.

And if we can compare the accuracy versus the computational complexity. So the older models were fairly accurate, but fairly complex. A large number of parameters, but not necessarily a large number of operations. And then we moved into the era of larger models, more complex models in order to get better accuracy. And recently, we’ve moved into a another era which is trying to become more and more efficient. So as we’re pushing more and more towards the top left hand side of this graph, which is where we want to be around high accuracy for relatively low computational cost.

You can see another way of looking at computational efficiency is the information density in the model. So this is a measure of how well the models parameters are being used effectively. And you can see here that the older models are certainly not efficient, and some of the newer ones and this would be ResNet, back of the ResNets are becoming more and more efficient. But still, if you look at the percentages here 10 to 12% per million parameter density is really telling us that there’s a lot of inefficiency in the representation, and these models are highly over parameterised.

Talking about deep learning deployment briefly, typically the model training environment is done on premise on a cloud environment using at least one GPUs typically multiple GPUs for large data sets in large models. And for cloud based deployment scenarios, we can typically use that same infrastructure to deploy. And then we simply trading off cost versus performance. So if we want the high performance, we can pay for that hardware and have the cost. But the story is very different when we look at edge devices. The edge devices have much more limited resources. Memory is limited and we can’t use all the memory on the device. Typically, we’re competing with other applications, computer is limited and again competing with other applications. And network bandwidth is limited. So we need to be in a restricted environment with respect to the size of the model that we can run on these devices. The latency and compute resources we can utilize to execute those models. And also shipping the models to the device takes network bandwidth. And this also applies to low latency applications which may be outside of the edge device paradigm. For example, financial trading, or programmatic advertising where our latency requirements are very low, could be low milliseconds or even microseconds. And we can’t afford to spend too much time computing so we need a more efficient model.

So how do we improve this performance efficiency, so that we can deploy to such environments as edge to edge devices as well as meet the requirements of low latency applications? Today, we’ll cover four of these. The first is architecture improvements over time, and new innovative ways of building architectures to achieve this goal. The second and third are model compression techniques, ways to make models smaller. And the last is model distillation.

So starting with architecture improvements.

One common trend recently has been building specialized architectures to target specifically this low resource environs. On the left, we see a representation of that Inception model. And on the right we see one very popular such model for these low resource edge devices called MobileNet.

Now, as we saw, the inception model is built of the standard convolutional building blocks. While the MobileNet model uses something slightly different, called the depth wise convolution building block. It’s effectively splitting the standard convolution into two components and competing on them more efficiently. So this is up to eight times less computation.

Now in return, we need to give up something typically and that’s something is typically the performance and accuracy of the model. So if we recall that Inception at 24 million parameters achieving almost 79% on ImageNet, MobileNet V1 has more than 80% fewer parameters, and we give up about 8% in terms of performance.

So another innovation in these mobile architectures and similar architectures is that it’s created out of a backbone architecture that can then be scaled up and down. So there’s two scale multipliers for making the model effectively thinner or wider. So adjusting the number of parameters in the model that way, and also scaling the input resolution of the input image. In this way, we can actually trade off how much accuracy we want versus how much computation resource and budget we actually have. So if we can afford to spend more on computer more powerful device, we can move up this green curve and be closer to the higher levels of accuracy. But for smaller devices or older devices, we might want to move down this curve. So one class of model allows us to actually trade off along this curve.

And as we saw the original MobileNets evolve into MobileNet V2, so it’s a very similar idea using exactly that same depthwise convolutional backbone. But adding some further algorithmic and network tricks, effectively making it more efficient. You can see a MobileNet V2 just moves us slightly up that curve and becomes more and more efficient, but it’s still allowing us that same trade off. So we get around the same number of parameters slightly fewer and then a couple of extra percentage points with accuracy.

So that is the next phase

of this model architecture evolution where we have a MobileNet models that are trying to be very very efficient, very very small parameter size and operation sizes but still getting a relatively high performance.

We can see here that indeed these models and other similar efficient architectures like Shuffle and SqueezeNet, have a much higher information density, much higher percentage per parameter metric that we can see here.

Next evolution recently was EfficientNet. And the idea here is very similar to find some sort of backbone architecture that can be scaled up and down. But technology called neural architecture search was used to actually find this backbone. So instead of manually handcrafting it one crunches through a huge amount of potential architectures and finds the best one. Optimizing for both accuracy and efficiency in the form of floating point operations for the number of operations. So this results in a class of models that can be scaled up and down depending on again compute budget to achieve accuracy, and then trade off against accuracy.

And we end up with slightly higher number of parameters, for example than MobileNet, but an extra few percentage of accuracy. Now one of the downsides of neural architecture searches is it takes a huge amount of resources to actually search through all of these so it may be out of the reach. Something like building your own EfficientNet may be out of reach for most practitioners but obviously large players like Google and so on, have the resources to create these models.

So we can see that we can scale this model up from the V zero, which is the baseline and a 5 million parameters and 77% accuracy all the way up to the largest model which increases the parameters by 12 times and gives us an extra 8% also of accuracy.

So the same concept is applied to the MobileNet architectures where similar hardware-aware neural architecture search technique was used to optimize both for accuracy and performance budget.

and again resulted in some form of efficient models. So not bumping up the parameter con too much, but effectively increasing accuracy significantly.

So we’ve seen that this idea of building an efficient kind of backbone model and then scaling it up and down has been very successful recently, either manually designing or using neural architecture search to design these backbones and to find the correct architectures are hugely costly in terms of computation.

Not just computation, but certainly in terms of energy efficiency, and even in terms of environmental impact. So one interesting idea recently another paper recently from IBM and MIT research is to effectively look for one set of network to train and after that network is trained, the number of sub networks can be picked out from that main network to target certain environments. So one, you can actually trade off that accuracy versus computation budget, you don’t have to retrain a different architecture or find a different architecture for each target. You train once and then you’re able to pull out the architecture that you want.

We’ve seen that simply doing more research effectively. So throwing a lot of compute budget and human power at the problem can help solve some of these challenges and give us more efficient architectures. We may not be able to do that ourselves, we may not be able to have the compute resources to run neural architecture search. And then we have to rely on large organizations releasing these models. So what are some of the other approaches that we can use on our own models? Well, first of these is model pruning. And the idea here is to simply reduce the number of model parameters. Now if we just throw away model parameters, we don’t really know what’s going to happen to the model. So what we wanna do is we wanna do that in a precise way and in a principled manner. So this is very similar to regularization, with the L1 norm in kind of standard machine learning linear models. We want to look at all the weights and remove the ones that have a very small impact on the prediction. Because if those weights are kind of close to zero anyway and we be taking them away, we are not going to have much impact on neural performance of the model and the accuracy but we can set those weights to zero effectively. And once there’s zeros we can ignore them. So we can ignore them when we’re saving and transporting the model around. So in other words you’re compressing the model. And we can also ignore them when we’re computing so we get a lower latency.

So we can see here that in fact, model pruning can be very efficient and effective. So this is an example from tensor flows model performance library. And you can see here that for a large complex model like inception, we can actually achieve 50% sparsity. So we can effectively throw away half the parameters of the model with a very small impact on accuracy. And after that, our performance starts to degrade. So, we start giving up more and more performance to get more and more sparcity. And similarly, for a small model like MobileNet we can achieve half sparsity by giving up a very relatively small amount of accuracy.

And it doesn’t just work for image classification. Here’s an example from language translation. And in fact here we see that for a language translation model go up to 80% of sparsity. So in fact, throwing away up to 80% of our model parameters and we are getting a slight increase in the performance. This is really telling us that these, especially these large complex models, are in fact, highly over parameterised already, so they’re kind of too big, and they have too many weights for what they need to do. So we saw that information density and information efficiency stats. So many of these weights we’ll not actually use them, you can actually just throw them away, as long as you do it in a principled manner without giving up much performance. And again, depending on what environment we’re targeting, if we’re willing to give up more in terms of performance and accuracy, we can make an even smaller model.

Another form of model compression is quantization.

Most deep learning computation is using 32 or even in some cases 64 bit floating point numbers. And the idea behind quantization is to reduce the numerical precision of both the weights and the operations in the network. And we do this by binning values. So if we start with a typical 32 bit number, we want to bring it down to a number or representation of fewer bits, effectively fewer bins. And this means that we can reduce the size for storing the same amount of information. So if we go from 32 bit down to 16, for example, we are effectively halfing the size of the weights that we need to store. And if we go down to eight, we’re getting four four times.

But we’re not only just saving some space, it actually turns out that the computation on these low bits and low precision representations can be faster and on some very resource constrained very small edge device CPUs.

As for floating point operations may not even be supported.

So the idea here is to take a look at the weights.

And if we look at a histogram, for example of some of the weight values, we can recreate that with a low precision set of weights. And we do that by effectively an approximation. We’re binning the values into the closest bin. And in return for giving up a bit of accuracy, we’re going to have a much smaller model size and potentially improved latency. So popular targets for this, as I mentioned, 16 bit floating point, but also eight bit integer coding, where we get a large size benefit as well as in factor performance increase in operations.

There’s two ways which we can do quantization. And the first and probably most common is to do it after training. So this is really useful if for example, you can’t retrain a model or don’t want to. Its either too expensive to retrain the entire thing on a huge data set, or it’s possible that you can retrain but you just don’t want to for other reasons you don’t wanna spend that time and effort. So here we typically will give up accuracy. And most of the time, you can target 16 bit floating point, dynamic range or eight bit integer encoding. So we can see, for example, that for both Inception and MobileNet in this case, we get a decrease in the accuracy for doing post training quantization.

And for a smaller model, it’s effectively a bigger decrease in accuracy.

The other way that we can do this is by actually training effectively on using quantized weights and quantized inputs. So this is much more complex to implement. But fortunately for you some of libraries will put out there for deep learning will allow us to just do that. But effectively, what we need to do here is take into account the fact that there’s going to be overflows. And in particular things like gradient computations. So, gradient accumulations and updates can become very, very small. That’s where we risk overflowing the precision representation. A way around that is typically to keep higher precision representations around for certain things like gradients and gradient updates and weights. And then at the end, either throw them away if we don’t need them or kind of do some quantization afterwards. The benefit here is that it’s a lot more complex. And of course, it requires the original training data and we need to retrain the entire model. But we can get a very large efficiency gain with minimal accuracy, loss of injury. So you can see again, for an Inception model, a small decrease in accuracy, and somebody from MobileNet a much smaller decrease in accuracy.

So again, indicating for a large model like Inception is probably over parametrized anyway. So making the precision low and reducing the model size is not actually hitting our accuracy too much with MobileNet is already a lot more efficient. So by reducing the amount of information and approximation effectively happening in the model and the network, we’re having a larger impact on performance.

So if we take a look at the latency impact and the size impact for the post training quantization, for a large model we still get a benefit 25% increase in latency or decrease latency. Whereas in this example for the smaller model, we actually get an increase in latency but for the training aware quantization, we get an improvement on both sides. And in terms of model size, we get a 75% reduction in model size, which is pretty significant.

So fortunately this used to be quite a challenging thing to do, you’d have to roll your own quantization. And there’s a lot of manual effort involved. But now with the deep learning libraries that we have out there, there’s a lot of options. And typically for TensorFlow model optimization, and Pytorch you have quantization built in, as well as third party libraries to allow you to do this in various ways. So it’s become much easier to actually just simply run the optimization on your model directly from the library, you can do it in a pretty straightforward way.

So the final technique we’ll discuss today is called model distillation. So as we’ve seen previously in the previous sections of the slides, many of these large models and even the smaller ones, but certainly the large complex ones are definitely over parametrized. So there’s a lot of inefficiency in the way that they’re representing the weights. So the idea behind model distillation is to take one of those large and very complex over parametrized models, and effectively use it to teach a much smaller simpler model. So you have a student model, which is small and somewhat easier, and a teacher model that’s bigger and more complex. So effectively, we want to distill down that core knowledge of a large model into a smaller model that is going to be more efficient in terms of memory utilization, and compute resource utilization.

So you can see here the core idea is that the teacher model will have a certain number of layers typically the student model will have a much smaller number of layers and we are using the predictions effectively from the teacher model to teach the student.

And somewhat surprisingly, if we look in this example, from the original model distillation paper by Geoff Hinton, we have a baseline model and if we use as a teacher model in an ensemble of 10 of these baseline models, we can actually get a distilled single models, simple student model that does better than the baseline and only slightly worse than an ensemble. So again, this is really indicating that a lot of these models are effectively over parametrized. And we can extract the core kind of learning and the core effectively knowledge of those models, and use them in a simpler model. Probably also indicates that there’s a lot more work to be done in actually improving the efficiency of the core architectures and the baseline architectures.

So model distillation certainly been successful in the image space, image classification and computer vision space, probably not as much used these days relative to techniques, especially such as quantization then to some extent pruning. But in more recent advances in natural language processing like transformer based models, but in particular, distillations have been very successful. And so a couple of them are mentioned here, DistilBERT and TinyBERT. There’s a number of others. But in many cases, these distilled models managed to perform as well or in some cases up perform much much larger, both language models. And this is again an indication that these huge models and especially these massive language models being trained on the entirety of Wikipedia, for example, are learning a lot but also have a huge amount of redundancy and over parametrization. So again indicating that number one, we can create a much more efficient model for deployment into resource constrained environments that performs as well by this distillation process, but also that we we have a long way to go in improving the efficiency of the core architectures out there.

So in conclusion, we’ve seen that these new model architectures for making networks more efficient, and in particular, targeting edge devices and low resource environments, are evolving really rapidly. For certain use cases in particular, as we’ve seen computer vision and in some cases natural language processing. You may be able to find something that fits your needs and in that case, you can you can just use EfficientNet, MobileNet

for him ensure one of the smaller versions of the Bert models.

But in many cases, they may not be a kind of pre trained efficient architecture or edge device targeted architecture for what you need. So in those cases compression techniques, such as pruning or quantization can yield really large efficiency gains. And they’re pretty easy to do in the main deep learning frameworks and supporting libraries. You can certainly look at combining both techniques, pruning and quantization, but doing it maybe a little bit trickier slightly, do quantize first and then prune or do go the other way around. That is an open area of research certainly.

There’s been good success in using them together. But it would take some experimentation for any particular use case. You can certainly take one of the more efficient model architectures that have been trained by one the larger organizations and you can then still apply this compression techniques. And finally, model distillation is a bit less popular overall,

and arguably a lot more complex but potentially compelling in some use cases, in particular, new advances in NLP. This is an area of rapid research evolution. There’s a lot of research coming out of both academic institutions as well as corporate labs.

And there’s a lot of new work

end of last year, beginning of this year even coming out that are making rapid advances in the space.

So thank you very much for joining today. You can find me on Twitter and GitHub as Mlnick. I encourage you to go and check out codait.org where we list all the open source projects in the data and AI space that we work on. And finally one of those projects is the model asset exchange, which is a free and open resource and repository for deep learning models, state of the art deep learning models, across a wide variety of domains including computer vision and natural language processing. And you can find a few of these models that I mentioned, for object detection, image classification and segmentation there. And also you can find a couple of options for deploying darker containers, being able to switch between smaller models for targeting edge device environments for example, and larger models for targeting cloud environments. So that’s one example of how you can actually dynamically use different targets and different models for different deployment environments.

Finally I’ve provided a few references for some of the topics that we’ve discussed today. Thank you very much. And I encourage you to reach out to me.

IBM

Nick Pentreath is a principal engineer in IBM's Center for Open-source Data & AI Technology (CODAIT), where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match, and Mxit. He is a committer and PMC member of the Apache Spark project and author of "Machine Learning with Spark". Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

Nick has presented at over 30 conferences, webinars, meetups and other events around the world including many previous Spark Summits.