Object Detection with Transformers

May 26, 2021 04:25 PM (PT)

Download Slides

Object Detection with Transformers: From Training to Deployment with Determined AI and MLflow

Object detection is a central problem in computer vision and underpins many applications from medical image analysis to autonomous driving. In this talk, we will review the basics of object detection from fundamental concepts to practical techniques. Then, we will dive into cutting-edge methods that use transformers to drastically simplify the object detection pipeline while maintaining predictive performance.  Finally, we will show how to train these models at scale using Determined’s integrated deep learning platform and then serve the models using MLflow.

What you will learn:

  • Basics of object detection including main concepts and techniques
  • Main ideas from the DETR and Deformable DETR approaches to object detection
  • Overview of the core capabilities of Determined’s deep learning platform, with a focus on its support for effortless distributed training
  • How to serve models trained in Determined using MLflow
In this session watch:
Liam Li, Developer, Determined AI



Lian Li: Hi, everyone. Welcome to my session on object detection with transformers. Really excited to be speaking at the data and AI summit this year. My name is Liam Li, and I’m a machine learning engineer at Determined AI, where we’re building the leading deep learning platform for state of the art training and hyper parameter search.
So today, I’ll be talking to you about how to train these new object detection models with Determined and then deploy them with ML Flow. So this doc is split into three sections. First, we’ll go over object detection, just so we all understand what the problem is. Then I’ll go over the detection with transformers architecture, and also some follow-up work to that called Deformable DETR. And both of these models are some recent approaches that have been introduced in the research community that I find very exciting. And then finally, we’ll talk about how to train and deploy these models with Determined and ML Flow.
So what is object detection? I think we’ve all probably seen something like the video I’m showing here, where basically for each frame of the video, we have a model that’s generating predictions about where the location and also class of different objects in that particular frame are. So that is the goal of object detections, to be able to identify the location and classes of objects in an image. And object detection is a really core task in computer vision because it’s a building block for many downstream tasks like pose estimation, event detection, video understanding and so on.
So today, we’re going to be focusing on object detection and one data set that’s used a lot in the research community is called the COCO data set for object detection. And here we’re looking at one of the images from the COCO dataset and each image has a bunch of different associated annotations with them, corresponding to the image. And each annotation consists of a class label, a segmentation mask, which is shown here on the right image, where we have a list of X Y coordinates creating a polygon mask of the object. And from the segmentation mask, we can derive bounding boxes like the green box shown here by just taking the minimum and maximum of the X Y values over all the coordinates.
So for object detection, we’re not going to be using the segmentation mask. We’re only going to be focusing on predicting the bounding box coordinates. The segmentation mask is used for other tasks like semantic segmentation that are kind of related to object detection, but we won’t be talking about that here.
So given the data, the prediction problem then is basically to take an image and then output a bunch of bounding boxes as well as associated classes for that image. So most of the approaches nowadays use deep learning models to be able to generate the predictions. And you can see here, we have a combination of large and small boxes that somewhat correspond to the actual objects in the image.
So to evaluate the performance of a model, we have to be able to take the predictions that the model generates and then match them to the ground-truth labels. And then finally just evaluate whether some of the quality of the bounding boxes, as well as the correctness of the classes.
So to do that, they’re kind of two metrics and two concepts we need to be aware of. So first is intersection over union and here you can see in the first image where showing the intersection between the ground truth green bounding box and the predicted black box, and then on the right is the union of those two boxes. So as you can imagine, the higher the ratio of intersection over union, the better our bounding box.
And with this notion of IOU, we can start setting thresholds to determine which predicted boxes we should actually consider as having an object in them. So you can imagine we set a higher IOU threshold and we’ll get fewer predicted bounding boxes. Whereas if we set a lower IOU threshold, we’ll get more predictive bounding boxes, but some of them might contain a noise or just bad predictions.
So the next step is calculating the actual metric that we’re interested in and for object detection, the metric that people usually use is called mean average precision. And to understand this metric, we first need to understand some sub metrics, let’s say. So the first is precision and precision basically tells us how good our model is at predicting positive predictions. So what portion of the positive predictions are actually correct? So how precise is the model when it’s making positive predictions?
Another metric is recall. So here recall is trying to capture what portion of the true positives we are actually capturing with the model. So the higher the recall, the more the actual true positives we are able to detect. And as you can imagine, if we set a higher IOU threshold, that will correspond to higher precision because we’re requiring that the bounding boxes be higher quality, but that also consequently usually means lower recall. So there’s this trade off between precision and recall, and to capture the performance of our model for multiple trade-offs between precision and recall, we simply average the precision over different IOU thresholds.
So that’s how we can calculate mean average precision. And this is the main metric through which models are evaluated.
Okay. So let’s jump into the DETR and Deformable DETR models. So the first question that you might have is why should we care about DETR? And the main sort of innovation of DETR is it replaces a lot of the custom components in existing object detection networks with transformers and simplifies these pipelines significantly. So we’ve seen transformers have revolutionized NLP, but not so much computer vision. So it’s a natural question to ask whether transformers are also applicable to computer vision.
And as I said, existing methods, like the one shown below here, called Faster RCNN, which is very popular approach to object detection, have a lot of hand design components that are difficult to implement and introduce a lot of complexity. So if we just look at a few of these, we are training a separate network called the region proposal network that’s used to generate areas of the image that we should focus on. And then there are a bunch of components after that to kind of filter, de-duplicate, those predictions to get to a reasonable set at the end.
With DETR, all of these sort of expert design components are just replaced with a standard transformer, encoder, decoder architecture. So you don’t have to worry about these more complicated components on top. We can just use basically the standard implementation of transformers that you can get access to in say, Py Torch, and use that as a layer in our object detection models. You can see the DETR architecture is much simpler.
So what does it actually look like? Let’s dive in a little bit more. So we’ll kind of walk through the architecture step-by-step to get kind of a more detailed understanding of what’s happening with the DETR architecture.
So first, there’s a backbone. And this part is shared across almost all object detection approaches. Some commonly used backbone are the resonant architectures that are trained on image net. So we use a backbone architecture that has image features, and we unroll them into a sequence of inputs because that’s kind of the data that’s expected by the transformer. So the transformer operates on sequences and not images.
So we flatten that sort of output from the backbone into a vector. And then next we add positional encodings to kind of add position, no information to that vector. Without the positional encoding, we don’t know how the unrolled pixels relate to each other. So we need to have some information to pass to the transformer that says, okay, this pixel is here and relative to that pixel. So that’s why the position encoding is used.
And then that sort of joint vector is passed to the transformer encoder. And if you’re not familiar with transformers, you can think of the encoder portion as basically learning how to attend across sequences for each position of the sequence. So if we look at the outputs here, we have an image of cows. And for each of the pixels, we’re showing what the attention mask looks like. So if you look at the red point on the brown cow, you can see that the attention mask corresponding to that pixel is learning to focus on the body of the cow. And we have kind of similar masks for other pixels in this image. So the encoder is just learning how, which pixels are important for a particular pixel when making the prediction, the downstream prediction.
The output of the encoder is then passed to the decoder. So if you remember from the Faster RCNN architecture, there were 200,000 something regions that were generated for consideration of having a potential object. So the DETR architecture doesn’t generate 200,000. What it does instead is it says, okay, I am going to fix the number of objects queries that I think are in the image. So for most cases, 200 is more than sufficient, which is what they use in the paper. So they fixed the number of potential bounding boxes. And then the decoder basically takes the output of the encoder. And for each of those possible objects, it tries to learn how the encoded output should influence the class prediction of that object for you. Right?
So if you look at the decoder output for each of the bounding boxes, which you can think of as a single object query, we’re learning to focus on certain regions of the image. So if you see the highlighted parts of the elephant, those are kind of the outputs from the encoder that the attention weights has decided to be important for the prediction of the elephant. Right? So the decoder is basically learning how to attend to the outputs of the encoder for its class predictions.
And then finally, the decoded output get passed to prediction has that generate the bounding boxes as well as the class predictions and the class projections can be no object, if there doesn’t seem to be anything meaningful in a particular object query.
How do we train the network? So if you recall, we get basically predicted bounding boxes and classes from the network. So to train it, we have to match those predictions to ground truth. And that is done in DETR with basically, it’s called a bipartite matching algorithm called Hungarian, the Hungarian algorithm. And what we’re trying to do with this algorithm is match the predictions to the ground truth, and it’s a one-to-one matching. So that’s why it’s called bipartite matching. So we’re for every single ground truth label, we’re going to match a single predicted bounding box and class to that.
And finding this matching is done by minimizing the loss that you see here. And the loss has two components. The first component basically is trying to maximize the class probability, so want to match predictions that have very high probability assigned to the true class label. And the second component is kind of capturing how good the bounding box is. So we want boxes that match up very well with the true bounding box.
So if we minimize the loss, we get a matching, and that is the matching that we use to form this loss here, which has kind of the same two components and it’s differentiable. So we can just use back prop to update the weights and train our network.
So how well does DETR perform? So here I’m comparing the performance of DETR to Faster RCNN, which is a pretty standard object detection network. And you can see that DETR is able to match the performance of Faster RCNN with respect to the mean average precision, but there’s still certainly some drawbacks of DETR.
So in particular, you can see DETR converges slowly. We require 500 epox to reach the same performance as Faster RCNN. Additionally, it has poor performance on small objects, and this is attributable to using a single layer from the CNN backbone. So if you take a single layer, we just get one resolution of features. And we’ll talk about how to improve upon that with followup work next.
So follow up work that kind of addresses both of those issues was done by [Jewinall] in their paper that introduced Deformable DETR and the main contributions in this paper are the deformable attention module that uses sparse attention instead of kind of dense full on attention. And it also extends DETR to use multi-scale features. And we’ll talk about what each of those contributions are in detail soon. And in combination, both of those features are able to help Deformable DETR converge much faster and do so with fewer samples.
So let’s see what the improvements are. So if you look at kind of why DETR converges so slowly, the problem is basically attributable to the fact that we’re unrolling CNN features into a very long sequence and transformers are known to work poorly on line sequences of text and a similar thing holds here for images.
So what happens is when you have a long sequence, the attention mask is spread thinly across all the items in that sequence. So it takes a long time for us to learn something meaningful to attend to. And the long sequence also means it’s computationally intensive, but due to the quadratic dependency.
So why is it quadratic? In standard attention for a fixed point in our sequence. So say we’re looking at an image and we have a query vector that we want to attend to, we need to look at all the other pixels in this image for a particular query. And when you do this for all the queries in an image, you wind up with a total of height by width by height product. So that’s why it’s probiotic dependence. And for long sequences, this is very expensive.
And the solution that the authors of the Deformable DETR paper came up with is called deformable attention. And their insight is that instead of attending to every single pixel in the image, we’re going to learn just a few values that we should attend to for a particular query. So instead of looking at HW pixels, we’re going to look at K, with K being much smaller than H by W. And location of these K values are learned and then we wind up with a much smaller number of total dot products for attention. So we’re going to learn where to focus on instead of trying to attend to everything.
So that helps with kind of the convergence rate. The other question is how can we improve performance on smaller objects? And if we look at kind of other works in object detection, there are a lot of methods that make use of what are called multi-scale features. So instead of just taking the last layer of the CNN backbone network, we’re going to take multiple layers and each layer has a different resolution. So you have the top layer, that’s lower resolution. And then as you go deeper, you get higher resolution features that you can use to make your predictions. And multi-skill features are an important component of many object detection approaches. You can see some of them below.
And what the authors of the Deformable DETR paper do is they are able to take advantage of multi-skill features by generalizing the deformable attention mechanism that we looked at before to apply to multiple resolutions. Right? So now, instead of applying attention to a single layer of the backbone, we’re going to do so for multiple layers. And that basically allows us to look at features that may be relevant in different resolutions, and then use that to boost performance on hopefully smaller images.
So how well does the Deformable DETR perform? And if you look at the convergence curves or the learning curves for Deformable DETR versus a normal DETR, we’re able to see that Deformable DETR does converge much faster in every single step of the learning curve.
And if we go back to the table before, we see that Deformable DETR is able to exceed the performance of Faster RCNN and DETR with just 50 epox of training. So we do converge a lot faster, like I said, because the attention weights now is not spread out across 800 things. It’s spread out just across say 20 different points.
And if you look at the performance on small objects, Deformable DETR does do much better on smaller objects. It matches the performance of Faster RCNN. So the multi-scale features are definitely also helping here.
And if we look at kind of state-of-the-art in object detection, which I would say Efficient, that is definitely one of the leading methods out there. Deformable DETR with a better backbone is achieved very similar performance to Efficient DET. So we are able to scale up Deformable DETR to larger models and match the performance of state of the art with a much simpler pipeline for object detection.
Okay. So let’s finally start talking about how you can train these really cool models that are coming out from the research world. So, we’re going to be looking at how to do training efficiently with Determined and Determined is basically a training framework for deep learning. And we focus exclusively on training to make the experience great for the user. And basically, you can think of Determined as fitting in the middle here, where you can interface with other solutions, before and after model training, to kind of help you with data preparation and pre-processing in the beginning. And then after you train your model with Determined, you can interface with deployment solutions shown on the right.
So I think the core features of Determined that make it super useful for model developers are as shown in the middle here. So we offer state of the art distributed training. You can also use state-of-the-art hyper parameter search algorithms with a leading early stopping based approach that I developed as a PhD student. You can also just use the platform to do experiment tracking and cluster management. So it’s really easy for you to just focus on model training and not have to think very much about the dev ops aspect of training models at all.
So we’ll go a little bit into how to do training with Determined in the demo. And then we will talk about how you can deploy your models using ML Flow survey.
So let’s dive right in to the demo. There’s a link here. You can follow the notebook as I go along. The notebook won’t be connected to a Determined cluster, so you won’t be able to submit experiments, but I think it’ll still give you a good idea of how you can use Determined for training. And if you’re interested in training these models yourself, we have implementations of DETR and the Deformable DETR available for you.
So let’s see how we can train DETR models efficient in Determined and deploy them with ML Flow. So here, we’re looking at a notebook that’s running on a Determined cluster with [inaudible]. And first, I’m just going to input all the necessary modules, and then have some helper functions so that we can use it to visualize the COCO data set that we described in the first part of the presentation.
So this data set here, we’re reading it from the Google cloud storage buckets. And let’s just take a look at one of the images so that we can see whether thing look reasonable. So, here, we have a man holding a tray of hot dogs with the ground truth annotations shown on top of the images. So this is kind of what we’re trying to predict with our object detection models.
So let’s see how you can train the DETR model in Determined. So the first thing you have to do is define a DETR trials that implements the Py Torch trial interface for the DETR model. And if you’re familiar with Py Torch Lightening, our Py Torch trial is basically the same as what’s expected for a lightening module. It requires certain functions to be defined for your models so that we can go and train your model for you. And certain functions are expected by Py Torch trial. And these functions basically tell us how to create the data loader as well as how to perform the training and evaluation steps for your model.
So in the initialization function, we’re building the DETR model using the same code that’s provided in the GitHub open source by Facebook AI research for DETR. So that’s what the function that we’re calling here. So that’s the build model function. And then we’re creating the optimizer and LR scheduler as well and making sure Determined is aware of the model, the optimizer, as well as the LR scheduler, by passing those objects to Determined context.
Other functions that Determined requires are the build training data loader, as well as the build validation data loader functions. And these are pretty standard functions for building the data loader, but Determined does go and chart the data loaders for you automatically, if you’re performing distributed training.
And then last are the train batch and evaluate batch functions. And these are very similar to the functions that you would write outside of Determined that specifies how to create the last function and then take a backward pass on it before making a step on your optimizer.
So once the trial is defined, Determined … The next step is specifying an experiment config to go along with that trial definition and the experiment config specifies things like hyper parameters and also how long you want to train the trial for and how often you want to evaluate for. You can also see you can specify the GPU resources that you want to use for the trial. And then enabling distributed training and steeling up with two more GPUs is as simple as changing this slots per trial experiment field. You don’t have to modify your model code whatsoever in order to scale up to distributed training. So that’s kind of one of the big benefits of doing things in Determined.
Once the experiment config is defined, you can then submit it to the Determined cluster to run by running this command line, the CLI command in a terminal. So let’s go ahead and do that so we can see the experiment in the Determined Web UI.
We’ll first go to the experiment directory that has the model and experiment config defined. And then we can just simply execute this command to send experiment 434 to the cluster. So this is the Determined Web UI, and is very useful for monitoring the state of your experiments. As well as monitoring the state of your cluster.
So we go back to … Let’s find the experiments that I just created, which is 434. We can see that it’s active right now. And the master is basically provisioning resources to satisfy this, the new GPU requirements for this, for this trial that we just submitted.
We’ll come back to this experiment a little later. In the meantime, let’s talk about how you can also very easily run HP search in Determined. It’s basically as simple as modifying the experiment config to define a search space via ranges for different hyper parameters instead of constant values. So once you define the search space, then you can configure the hyper parameter search algorithm you want to use. We support the state of art hyper parameter training method called Adaptive ASHA that uses early stopping to significantly speed up hyper parameter search. And similar to the experiment before, you can just submit it again to the cluster by running a CLI command.
So let’s go back to the experiment that we submitted before and see how things are looking. Here, experiment 434, we now see the trial. There’s a trial that’s active. And here, we can see the validation metrics that have been computed during training for this trial. We can zoom in on the trial and see other metrics that are being generated by this experiment here. So we have a bunch of different validation metrics, as well as training metrics. If you want, you can view these metrics in greater granularity. In [inaudible], we store those metrics, every batch, instead of larger intervals as shown here.
And we’re also saving checkpoints automatically during training for you. So if a validation metric exceeds the previous best, we’ll take a checkpoint. So you have access to that checkpoint later. This supports automatic fault tolerance. So if your instance, fails while experiment is running, Determined will just go back and relaunch that experiment for you, if you specify that for your experiment.
So once your model is trained, Determined’s taking checkpoints for you automatically in the backend for automatic fault tolerance, and you can register those checkpoints with our model registry for easy access by running commands here. I’ve done that already for one of the DETR checkpoints that I trained previously, so we can easily load it for use with ML Flow. So now the checkpoint is loaded and I’m going to make sure ML Flow is installed.
And then to deploy a model, all we have to do is kind of massage the Determined checkpoint into the format that’s expected by ML Flow. So that’s what’s done in this wrapped DETR module to sort of get the structure to what is expected by ML Flow. And we can save this model with ML Flow, along with the environment dependencies, as well as files required to execute the model by running this command. And once the model is saved, we can serve it with a simple command line, executing some command line, CLI command for ML Flow.
Now, let’s go back to the work directory where the ML Flow model is stored. We can run this command to get a rest API end point up where the model is waiting for predictions. So once the serving end point is up, we can then just submit images to the end point for predictions with this function.
And let’s see what the output looks like for the first image and the validation dataset. It’s taking a while because of the GPU is not quite warmed up yet, but here we can see the predictions from DETR. For the most part, it looks pretty good. The only one that looks a little bit off the refrigerator, but it is doing something pretty reasonable for the other objects in the scene.
So this kind of concludes the demo. And if you’re interested in running any of this yourself, there’s a link in the presentation to where you can find the code. So if you’re interested, definitely go and check it out.
Yeah. So thank you for your attention. And if you want to learn more about how to use Determined AI or ML Flow, you can go and visit the associated websites. And again, your feedback is really important to the organizers. So please don’t forget to rate and review the sessions. Thank you.

Liam Li

I recently completed my PhD in Machine Learning from Carnegie Mellon University, where I was advised by Ameet Talwalkar. My thesis was on efficient methods for automating machine learning model to mak...
Read more