Live demo and lessons learned building and publishing an advanced video analytics solution in the Azure Marketplace. This is a deep technical dive into the engineering and data science employed throughout, with all challenges encountered by combining Deep Learning and Computer Vision for object detection and tracking, the operational management and tool building efforts for scaling the video processing and insights extraction to large GPU/CPU Databricks clusters and the machine learning required to detect behavioral patterns, anomalies and scene similarities across processed video tracks.
The entire solution was build using open source scala, python, spark 3.0, mxnet, pytorch, scikit-learn as well as Databricks Connect.
Claudiu Barbura…: Hi, my name is Claudiu Barbura and I’m Director of Engineering at Blueprint Technologies. Welcome to my presentation on video analytics at Scale. All these acronyms, deep learning, computer vision, machine learning on the Databricks platform. So, let’s get right into it. The agenda for today, I’ll just start with the live demo of our video analytics solution because to me, that’s the best way to provide a context for the rest of the presentation.
Then we’ll dive into the solution architecture, and then I’ll go with the lessons learned throughout the entire solution building. Everything from infrastructure to the deep learning aspect, the machine learning aspect. The data science, all the engineering data science efforts that we put into this, into building this solution, and we’ll leave some room for Q and A at the end. So, here’s our video analytics solution that we built in a very short amount of time. It was two, almost three months of very intense work.
The purpose of the solution was to build a very narrow use case into data analytics. And it’s about now monitoring video traffic from traffic cameras. The goal was to showcase our expertise, our capabilities and the intent is to provide as a free trial in the Azure marketplace. One of the prerequisites was to use Azure cloud services for this. So the entire solution is built using open source tech, but it had to be deployed in Azure which we did, and I’ll do a quick demo of it. The solution will allow you to… And I’ll show you what it needs.
Basically the solution after you deploy it from the Azure marketplace, it needs a storage account where it’s going to pick up the videos from and it’s going to output some data. It needs a Databricks cluster to do all the processing, can be CPU or GPU. In order to have Azure active directory single sign-on, you just have to provide the app registration. So, Databricks configured, I’ll share what that means. Once you configure and then attach these assets to the application, the solution deploys on one VM. It’s a Linux VM in the cloud and then you attach these dependencies.
It’s important for us to know what kind of cluster, ADB cluster, you provide for all the data processing because that will determine which image to load into the Databricks cluster. Which darker image to load, because it will have all the dependency required for to process. All the open CV, all the object detection, and tracking, and all libraries, and so on. Once you go to Workspace, you can create a new workspace, and I’ll do that right now. You provide it a name, AI Summit Demo, and then you go in and now you connect your blob storage account.
Then you pick a video that you want to process. I’ll pick a short video just to start it off. Intersection four, you get a preview of it. You can see this is a animated video that we created, and you see this individual here in the middle of the intersection doing a donut, and that’s going to be relevant for the demo moving forward. Once I’m happy with my video selection, I’ll say, fine, let me just go ahead and process it. I could look at the advanced settings and change the defaults, but basically the application will extract traffic patterns.
We’ll find peaks and valleys, you see in the picture. Surges and dives, and oscillations in traffic patterns. We’ll also find anomalies, and then you can configure the anomalies, and then you have some configuration for the processing of the videos. So some of the parameters here are important for smoothing, as we’ll see in the video. I’ll talk about that in detail in the slides, how to smooth out these curves because there’s a lot of noise coming from an imperfect object detection process. It’s never perfect, there’s errors so we have to do all this extra processing at the end.
Once I’m happy with my default settings, I’m going to go ahead and say now run the analysis. So what happens here is this video is now being sent to the Databricks cluster for processing and then any events, any signals coming from the processing are being sent back to the analyst. So, they’ll have the progress bar here, they’ll have the monitoring pipeline. They’ll be able to see what exactly happens with my processing. What happened behind the scene is an object detection, it already kicked off. Then it would be tracking and then video stitching with the abandoned boxes, and then the insights extraction. At the end behaviors, anomalies, and then building similarities between videos, right? So all of it is done obviously in parallel in the cluster. The progress of the responses is being sent to the analyst, so they can start analyzing the output immediately.
It takes a while to just warm up the processing. In the meantime, I will switch to an existing pipeline because it was already done before to show what kind of instance you can find from the application. I’ll pick one of the videos, this one. So, you’ll see object detection down in boxes around the objects that are found in traffic. For that, we’re building an object detection graphs. You can… I going to move my camera from here… You can see that the detection graph is in sync with the actual video. You can scroll to and zoom in to a particular moment in time in the video. This is not obviously processing videos from blob storage, so it’s not yet connected. This solution is not yet connected to live traffic cameras, because that’s something we will do as a professional service exercise when we build custom solution for customer. But for this free trial in Marketplace, just to showcase our tech, you connect to your blob storage existing videos, and then we do the analysis.
So you see the zoomed in area, the patterns that we found in this video, everything from oscillation, peaks, dives, and surges. You can zoom into them and you can see that there is a peak which consists of two events. A surge event followed by a dive event. That’s showcased here in the… It’s highlighted in the object graph. So if you can look at that particular behavior of interest, you can spot it on the graph. Also, besides detecting these patterns, we also do anomaly detection. That comes in this cameras here. We’re looking at what are the most common paths? All the paths of vehicle trajectories in the video. We’re finding the average paths, and obviously at the end, the anomalies. What are the most rare paths in the video? In order to showcase this better, I’ll switch to… Before I do that, I’ll show you what also the app provides. It provides you with the configuration settings. This is what you can come in and change, maybe the rate of change for the value.
I want vehicles per second. I want steeper curves. That would be basically extracted here, because I care only about sudden changes in traffic behaviors. You can configure all these parameters and rerun the analysis. Basically after you configure that, the rate of change you can make it… Now I want steeper curves for those decreases for the valleys. A decrease of vehicle count followed by increase of vehicle count, you can change that rate and just say, now reanalyze. Basically that’s happening after the processing that happens in the cluster, and doesn’t use the cluster anymore. What’s happening in the spark cluster? We have already, this existing video has been processed. Say you have a bunch of segments, data segments, and then we have batches of segments.
So the reason we do this is because once we send a video in the cluster, we fragmented into multiple parts and they’re being processed in parallel. So, not all tasks are equal. Not all video segments are equal because of the richness of the information in those videos. They’re being processed at different times, so maybe the cluster, some nodes in the cluster are slower than the others. Anyways, the solution and the service the Nash Services acts as an aggregator, and it’s waiting for all these segments to be processed. At some point in time, it has enough enough segments to process. In a list of consecutive video parts, it’s going to trigger the rest of the analysis like behavior extraction, the anomalies, and then building the similarities between videos. After tracking and data stitching, obviously.
Everything is being monitored and shown here because it helps the ops team figure out if there’s something wrong with the cluster, or maybe something wrong with the VM. As you see in the slide… Because of the lack of time, we could not push everything to the cluster. That was one of the issues when we’re running multiple videos in parallel, we’re now running in a VM, which is not a controlled environment. It doesn’t have a resource manager, so we would rather push all this processing into the cluster. We have a resource manual, we can control the throughput. In the end, it gives you the total duration. At the end of it, when everything, the video is being stitched together and tracking is being staged, we’re running that post-processing yet again, so we can find the final anomalies.
The reason is, we do this partial processing because we want to give the analysts the ability to analyze videos before it all ends. So mid video, they can still look at all those detection graph, the behavior, the exact behaviors, or the anomalies without waiting until the end. Something about the… Before we go back to the slides about the cluster, so we can run on both GPU and CPU clusters. There’s more were happening right now for optimizing against GPU cluster, and we’re moving to Spark 3.0, so we can have GPO or scheduling. This particular deployment is connected to this 10 plus one clusters. There’s these type of instances. If you notice, let’s key here, this darker image, right? This is from us, Blueprint from our Blueprint container registries. It has an image that will include all the dependencies required to do object detection, the computer vision processing, and the tracking in that cluster.
The solution is available in the Azure Marketplace for free. So go ahead and give it a try. Point to your videos in a blob storage, and off you go. All the insights are extracted and dropped into an elasticsearch. There’s a power BI camera that shows connecting directly to that data, and you see that my particular intersection. The low traffic… low traffic three was my video. That’s the actual raw data behind it. So, it’s imperfect, and you’ll see how the object graph detection involves smoothing algorithm to remove all this noise. I’ll talk about that in a slide. This is the architecture of the solution, as I mentioned. At the top, the requirement was to push it into the Azure Marketplace. Once you download it from there in your own subscription, you get basically the One Linux VM in Azure. Solution obviously includes deployment scripts, all based on Ansible and Docker Compose.
Then we have the Azure active directory component there. Remember, you have to register the app and that’s a manual process behind the scenes. You register the app with Dock Director directory. Then you come back and configure it to use that for authentication. Then, what’s in in the Linux VM? That’s this big box in the middle here. We have the Management Console, which you’ve seen is done… Basically all the rendering, all the analysis, the video rendering, the provisioning, the configuration, everything happens in that web console for the analyst, and for the administer of the solution. Then you have two services, jNash and pyNash. The JVM service, which has all the APIs, the video manager. The pipeline monitor does the encoding and basically the web socket push, basically sending all those insights down to the web applications, so we can inform the analysts that there’s more. There’s more segment, there’s more behavior, and so on.
So, that’s the JVM. The python service does all the heavy lifting. It’s actually providing… It’s the pySpark driver. It does all the object tracking correction. The Features Space Generation for the behaviors anomalies, event aggregation, and similarity, I didn’t show the similarity. Basically, you click on that similarity camera. You can find similar videos to the one that is being processed, because that was the whole point of the analyst… Doesn’t have the time to watch multiple videos at the same time. So, the solution will build a similarity score. The similarity is based on the behaviors detected in the individuals, not on the actual environment. That would be something for our roadmap to look at the similar environments, similar footprint, and build a similarity score on that.
And so that’s the python service, and then everything is being stored into Elasticsearch, and that’s part of the VM. These are the extent of dependencies, the Azure Databricks cluster that does the actual heavy lifting. The detection, object detection and classification, also overlaying the bounding boxes over the video segments. It’s doing the object tracking on those segments, and then blob storage obviously for where we pick up the videos from. Some output from the processing is also stored there. What’s not in elastic, it’s in blob storage, so that’s what the Spark just work with.
So lessons on the infrastructure side of things, I’ll quickly go over that. What’s the solution, why Linux VM plus the dependencies? What’s key for us to shift all compute from the VM to the Databricks cluster, because we have a controlled environment, the resource manager. So, then we can avoid having contention, especially when you’re running multiple videos in parallel. The most expensive part is the video generation, the FFmpeg part and that with multiple videos, can cause the VM to basically freeze. We’re in the process right now of moving all that video generation, the [inaudible] Generation into the cluster. We’re pushing our own Docker images with opencv, pytorch. I get to learn everything to the database cluster. We using database container services that has to be enabled for the cluster, so we can push those images.
Some optimization that we learned during… As we build this and we released newer version… For the CPU architecture, having a pytorch bound to one thread, basically give us know two to five X performance. That was huge. I guess the lesson learned here is that less spark and it’s, do the work basically. If you have tasks running over segment, let them just run with one threader time. Basically what was happening before, each task in a VM, in the cluster would create a bunch of contention, not with the other ones. So, binding it to one helped a lot. On the GPU architecture side of things, we had to move from an mxnet to pytorch because it just wasn’t working. Mxnet just wasn’t working for us on GP side of things.
These configuration options for running, basically the Tesla on the cuda device, that gave us some lift. The biggest lift we had from… Basically all these tasks are loading the modified deduction, and they were loading it from this. That was a lot of time wasted with that. As soon as we introduced that model as a Spark Broadcast variable, there will be cache basically, and then shared across tasks. We have this two to four X non performance improvement on the GPU now, because… the GP memory is… Because scheduling was done before on CPU. You have how many CPU cores, and all the task schedule was done based on that. That would create too many tasks run in parallel on one node, and that would basically cause the GPU out of memory condition.
Now it’s Spark 3.0, we’re in the process of moving to GPO or scheduling where we can know how many actual GPU processes we have, and then bound that, basically do the partitioning based on that number instead of the CPU core. So, that’s how we can avoid out of memory when you’re running, especially running multiple videos in parallel. I’ll show the performance between the two, the GPU and the CPU, per cluster at price parity. Pretty much, the GPU is it’s faster and it’s cheaper to run on GPUs.
The one lesson learned is… The type of instances, GPU issues we had in Azure are this Tesla V100, which was just too expensive. They give us six cores on one twelfth gigabytes of RAM. We call it CPU RAM, but really not useful to us. We would rather use something like Tesla T Force, basically less cheaper GPUs, which is no less a RAM and less cores because we don’t use them for our particular workloads, for object detection. We don’t use that much. So, we would have even better numbers if we had different GPU instances, In the future, we’ll optimize towards that type of instance. This is the graph for a GPU versus CPU for a few videos that we ran. You see it’s just that much faster on GPU’s. If you do a price comparison, it’s just cheaper to build on GPUs.
This is the data behind those graph. Lessons learned in object detection and tracking. Initially we use one set of models, that’s fasterrcnn resnet. Basically the resnet class of detectors and their performance is measured in mean average precision. It was 37 and is 21 frames per second. That’s the speed of the model. We moved to a faster, more accurate, efficient data architecture that has from D zero to D seven, different models. They have obviously different colors, the higher the number, the higher the precision, but also the higher the cost. So, the slower it is to process. That has a mean average precision which is a 46 and slightly faster, 22.7 frames per second. That efficientdet-d-3, which we picked by default, chaining that Microsoft COCO dataset, and the classes used our bus, car, and truck, because that’s what the narrow case case in the application is doing.
With the detection confidence threshold at 40%… All this is configuration based. So we learned is, not all videos are equal, algorithms are not perfect. What’s key to us is to expose this configuration options in the application. Ideally, even in the UI, so we can easily quickly fine tune it if a new data shows up. Mentioned earlier, we had to move from mxnet to pytorch due to the GPU architecture requirements. What also helped is batch framing. Initially we didn’t do that, and as soon as we started batching, basically take a batch of 5 or 10 frames. Load into the GPU, do the object detection. Before you switch context, you do the tracking, because the object tracking is done on the CPU. So this complex session between GPU and CPU can be expensive, and batching solve a lot of that problem.
For the tracking, we started with our own strategy, but then we moved to FairMot and it works. Like I said here, it’s a default tracker right now. It’s 10 times faster when it works. There’s some videos where it doesn’t do a good job. In that case, we fall back to our own custom tracker, and that’s using JDE, Joint Detection Embedding. Kalman Filter and Template Match, it’s a combination of techniques that will, in those cases, give us better tracking, but it’s much more expensive to run. So, there’s a trade off there, but we can configure that. This is the object detection model. You see the curve for the efficient data architecture. You have the score on the left, on the y-axis, the median average precision. The higher the D, the slower, but the higher precision rate.
So, that’s why we sell for D3 here, it’s a good compromise, a good average precision and for a good cost. On the horizontal axis, you have the FLOPs, basically that’s the cost. How many floating operations it needs to build out. The resnet family of the detectors are here so, less precise and slower. So, that’s how we change that in the architecture. Lesson, learned with the behavioral pattern and the anomalies. The anomaly for that intersection four video is the donut performed by the driver. So, that’s detected as an anomaly, because it’s obviously not something you would want to see in traffic. We’re basically using clustering here and the cluster representative with our pseudo-centroid in non-uniform vector length feature space. I know that’s a lot of words there, but basically what we’re doing is we’re building all the vehicle trajectories, and we’re clustering them.
We’re using DB scan for short video and agglomerative clustering for long videos. Those are the clustering approaches that work better for the… Independent on video length. Rarity is computed as distance from the cluster centroid, so that’s how we determine what are the outliers here, and what are the rare vehicle paths. That’s how the donut showed up as an anomaly. I showed in [inaudible] Cameras the actual raw data and that’s the raw data, 30 frames per second. The 30th of a second, that’s the frequency, and that’s how the data looks like. After we’re applying our timescale based smoothing algorithms, and that parameter is the key. Do I want to see oscillations smoothed at one second or seven seconds? If I’m looking at a four hour video, I don’t want to see behaviors at that granularity in seconds, maybe I want to see it in minutes.
I want to see patterns at minute interval or five minutes intervals, but if it’s a short video, 30 second video, I want to see it at a lower rate. So, that parameter influences the smoothing of the curve. If at one second you see the orange curve, at seven seconds, you see how it’s acclimated at the higher rate. We’re using low-pass filters and for year, decompositions for eliminating basically the noise from the raw data, from the raw curves. That’s how we build that object detection graph, and that’s how it’s so smooth. That was it. Please provide some feedback, and we have time to take some questions from the audience. Thank you so much for attending my presentation.
Claudiu is Director of Engineering at Blueprint Technologies, he oversees Product Engineering where he builds large scale advanced analytics pipelines, IoT and Data Science applications for customers ...