In Storaenso, we keep track of our inventories very precisely with all the responsibility but because it is a quite time demanding task, we could not ignore the possibility of automation, even if there is only a partial chance. In our current situation, to calculate inventory amount of wood in our mill, means to fly over the entire area with a drone having a camera device, capture RGB image and ground level amplitude image of the area and store them on server. Later, an expert from the mill opens those images in our internally developed software and marks all wood piles â€“ based on those marks and ground level amplitude images, the total amount of wood is estimated. This is very precise and in deed much faster mechanism as it used to be, when the expert needed to come through all piles personally, which took him several hours of work.
To move this approach further, we developed an image processing pipeline, which is able to automatically identify all wood piles with very high precision. The main technologies that we are using are PyTorch, Azure databricks and open-cv. The multi-step pipeline incorporates two very known deep learning models â€“ ResNet for image classification producing a heat map of wood appearance probability and Unet for image segmentation to fine tune the borders of each wood pile. The final product of the mechanism is a binary array reflecting whether the current pixel represents wood pile or not. The models were trained on GPU provided by Azure databricks environment, which significantly increased the speed of the training and testing phases. High precision (over 97% in both nets) and stability of the model (wood of all shapes and colors is identified) is leading us towards the extension by multi-task machine learning approach.
– Good afternoon or good evening.
Welcome in this presentation entitled as Wood log inventory estimation by deep learning. My name is Tomas Vantuch, and I will present to you a few slides about that.
Let’s start about agenda. I will briefly introduce the company, where I work for and also myself a bit, then I will describe the problem itself. It will be followed by a description of some data pipelines we use here. All of them are based on Azure resources, so basically there will be a lot of place or a lot of applications of Databricks. Then I will focus on the model itself, how it’s trained, how it works, and then its results and some conclusions at the end.
So Stora Enso is a leading provider of renewable solutions in packaging bio materials like construction and paper globally. We basically focus on wood processing and you know, pushing towards more efficient and more renewable materials from these.
Entitlement to renewable materials company has its reason and its reason is that all of we produce is based on wood. It comes, it can rise from wood constructed buildings, into some simple packaging solutions for food or whatever you can imagine.
As I said previously, my name is Tomas Vantuch. I work here for almost two years as a data scientist. I’m responsible for image processing, especially in this case, but also I’m working with some time series and also some development around chatbots or some web services. Before that, I used to be a senior researcher at Technical University of Ostrava in Czech Republic, and my specialization was in time series analysis.
So, let’s start about the project.
As you can imagine, because we work with wood, we need to process a lot of harvested wood. Basically it’s in several places in the world, we have those milliards, which are completely covered by woodpiles and we need to keep track of amounts of those woodpiles, so we basically do some inventorization of that material.
It can be done by the various ways, but our aim to do it is efficiently and as quickly as possible. So for the purpose, drone-based application was developed, it means basically that some drones will fly over the mill, it will take pictures of the entire surface, all the objects there, then those drones put together all, it’s not those drones, but the pictures from the drones are put together into some aerial map, and this aerial map is then delivered to human experts from the mill. His task is to identify and label all the wood inside. And because we have also a digital elevation model, it means that we have a ground amplitude of each pixel. We can then calculate the final amount of wood in the picture. But a very important key aspect is that very precisely wood needs to be identified by a human expert. As you can imagine, this is quite time-consuming task, human experts need to place more than 200 different points and several coefficients there, and right after that, the amount can be calculated. So this was taken as an opportunity and some machine learning model was created. Not to substitute the human effort, but to save some time and, you know, to leverage this information to some quicker solution. As you can imagine, those aerial maps are quite big images, they are in diff’, PDF format. On average, they have around 500 megabytes in size because they contain those three RGB channels and also the digital elevation model.
At the first stage, they are put together by the software
on this map processing server on the left side, that’s the place where we start, and in this row form, they’re transferred into Azure data lake. Now they are marked as a roll data, it’s their raw stage. Here we do just some trivial sorting, and we also collect some metadata around those images, we put them together into some database. And after everything is ready, their stage is changed to production and they’re ready to be used for prediction basically.
Some of those images are labeled for training purposes, and as you can see, this branch in the middle.
This diagram means basically that some of those images are used for model training. It’s pretty simple, nothing advanced just model is trained in Databricks environment, then it’s stored in a storage account with some additional metadata about the model. Basically what kind of model, what task it’s solving et cetera, because those metadata are used later for model selection.
We use several different models, and when the request from outside comes for prediction task, we have to decide what kind of model we are going to load for this prediction. The model is loaded, also the specific data for prediction are loaded and this prediction pipeline is then executed. All the predictions are stored into the production stage in the data lake with those source images, so further investigation by human expert is impossible. We also collect some metadata about data processing and about prediction, so we can also make it or make some improvements over time.
So the task itself, now you can see some images, what are we dealing with, how our mills looks like from above and the complexity of the task relies on several pillars. Basically we deal with different shapes of those woodpiles, not all of them are so straight. Some of them are curved a bit, also the color of the wood varies and most of those variations are based on the wood type because you know, average can have different colored line and different color bruises et cetera. And also the amount of wood, amount of water inside the wood is a significant player in this color changes because the wet wood, as you can see is much darker than the dry wood which is much lighter.
Also a lot of differences in lightning’s on shadows, plays important role in this image processing task. And as you can imagine, low amount of labeled data, that’s terrible issue in every machine learning task, we are facing that as well.
Naturally it looks like image segmentation or more specifically the semantic segmentation task. But because we need to make it even more precise, and a lot of those woodpiles are too close to each other. We need to provide also instances segmentation. What is the difference between that basically semantic segmentation identifies where is the wood and where is not the wood, like what big source are related to wood bios and what is the rest, basically the roads, object, buildings, et cetera. So by semantic segmentation, we can identify all objects that are representing woodpiles. And by instance segmentation, we are able to separate them one from each other, and that’s something we would really need because we need to keep track of inventory of each woodpile separately. And if some of those woodpiles are too close to each other, then this semantic segmentation basically outcomes for us, them as one object.
So as you can see for the purpose this model, utilize to the deep learning networks. The first one predicts the objects, the second one predicts the borders and by their combination, the final outcomes is delivered. Both of them are using the same inputs, but I will come to that later. Now about the U-net model that is pretty popular, widely used in lot of applications, widely used in Kaggle competitions, et cetera.
In our case, it was implemented in PyTorch framework with use of library segmentation models. I really recommend this, it’s well written and easy to used. What is special about U-net model? Well, it’s pretty similar to any other deep learning outer encoder with one modification.
Every layer of encoding path is connected to the coding path by additional concatenation. So like the results from those layers or those encoding levels are transferred to the decoding, and after the concatenation, the internal information is increased and the outcome is more precise, that’s how it works basically. And for encoder here we were able to use Resnet-18 architecture.
It means that we just increased the number of course, volitional filters. As I said previously for input channels, we use four of them, the RGB and digital elevation model, and 256 times 256 is the resolution of input images.
They were taken just small cutoffs from the entire aerial maps because the entire aerial map of resolution around five to 6,000 pixels times, maybe 10 or 12,000 pixels, so they were pretty big and unable to be predicted at once by this model. So smaller cutoffs were taken and it was predicted piece by piece.
And as I said, we had utilized two networks, one for objects and one for borders.
Image augmentation was taken as pretty necessary snap here. It is again pretty common way how to increase generalization and how to increase robustness of the model itself. The motivation or the aim is basically you are altering some features of the image that you are using for training, but you’re not restoring the information which is inside or you are not modifying the context itself. So as we can see, some basic image processing corporations are applied on this image of cat, but we still can recognize the cat.
The same was applied in our case, so some rotation, flipping, blurring or some linear contrast were used for RGB channels and some dropouts and hue manipulation was used for digital elevation map.
The motivation to use the different augmentation for different channels worked pretty good for us and it just took some fine tuning to find the best combination to not make the model too sensitive or the data too complex for the training. The motivation is to still keep the model converging, you know. To describe the overall models workflow, well, it possess four steps for like major procedures. The first one is related to data preprocessing. Basically we need to normalize the RGB channels of the image as well as it needs to normalize the digital elevation model. And also from the labels we need to produce two masks. One mask is of wood objects and one mask is of pile borders.
Then the first unit model is supposed to be trained on object masks and then produce object predictions. The second unit model is supposed to be trained on border masks and then produce predictions of the borders and some additional post-processing is supposed to join those predictions and in some simple filtering and a processing way to produce the final outcome, it can combine those informations.
Training and testing of the model, basically to mention a few important stuff, we utilize 14 different aerial maps for training. Each of them contained more than 100 different woodpiles, so they were pretty big. And by making use of those smaller cutoffs with some 50-pixel overlaps, we were able to produce more than 700,000 images.
Not all of them were used for training, because of the cross-validation technique, we always mix them somehow randomly and use part of them for training and part of them for testing.
As I said previously, both models possess the same architecture, so they were unit models with Resnet-18 encoding.
For optimization function, I started with Adam for few, but few, but for majority of airbox Adam was used, but at the end, like the last 15 or 10% of airbox were optimized by stochastic gradient descent with very, very low alerting grade. Motivation for that was basically when the model was not converging anymore, it hit some decent level. I just wanted to squeeze it a little bit if it was possible by this stochasticity. Sometimes it worked, sometimes it didn’t, but I didn’t mind. For the loss function I used pretty, pretty normal, and that is used in similar tasks. So binary cross entropy with logistic log loss.
For the batch size, I use number of 35 images. This number was adjusted to be as big as possible, as large as possible because in this image processing cases, you know, the higher batch size is actually improving the convergency, but actually to make it too big, you can basically overflow the memory. So number of 35 was somehow accurate. The decreasing learning rate is totally standard.
I decreased the learning rate in almost linear way, it depended always on how the model was converging and something about the hardware set up, you know, the 16 cores from Databricks, CPU cores were used for data loading because every time this smaller image was created or taken, then the normalization and everything was proceeded there, so it has been done in a parallel way, and then those batches were trained on GPU.
As a validation metric, the dice coefficient was taken.
So quite important step, I think, was the post-processing ’cause now we have predictions of objects and predictions of borders, so now we have to join them together and somehow take the best out of both.
So basically as the first step of post-processing is to do those predictions themselves. So we obtained the metrics of objects and the metrics of borders. Then we are supposed to apply some thresholds on them.
It is some kind of level of confidence and it can be derived from how the network is doing on distinct data. And because on object detection, it was much better. And our motivation is basically to remove or to mitigate the false hits. Then the level of confidence on object detection was much higher. It was like zero point five or zero point 55. So this was thresholded by this level, and the binary metrics was a product of that.
From these binary metrics as a third step, we were able to identify some contours of some objects there and simply those objects that are too small in size, they were omitted as well because they were considered as false hit. You know, wood bio has to have some minimal size, so I think it was like 500 pixels or something like that was the threshold, they’ll make those too small objects.
Then we did similar stuff with the metrics of borders or mask of borders. But we want difference and it is the level of confidence which was adjusted significantly lower. The reason for that is simple. the accuracy of those border predictions was a little bit lower, and also we could afford some false hits there. So basically whenever some significant confidence appeared and it was like 0.25 or 0.3, it depend.
We took it as a border pixel. So now we have one binary matrix of object and one binary matrix of borders, and basically we subtracted those borders from object. So those piles that appeared to be too close together, and they were joined by this border, by this object prediction, they were separated by this subtraction. Then we were able to identify contours again, and those contours were much more precise, so we didn’t, we don’t have so much bias joined together.
And for each of them, we basically draw them, apply some dilatation, you know, to fix those edges and to smooth them a bit and explore those contours into some JSON’s format, and at the end, we used also some polygon simplification, you know, because some of those edges were not so smooth. And it also produced to us too much of points, so this simplification was worthy to do. We lost minimal information after the operation.
So, to give you some numbers on the test data or test samples we were able to have like 97 to 98% of dice-coefficient based accuracy on object detection, which was pretty good, I think. For border detection, we only have around 90 to 94, 92% of dice coefficient. It varied, based on, you know, how data were mixed in cross validation, et cetera. But what is most important for us is when we put those two predictions together, how it will actually predict the final amount and the final amount was done separately on digital elevation model, and just compare it with our previous method, you know, the labeling by human expert, we obtained results that were so similar with the deviation between the previous solution was less than 5%. Actually in most cases, it was around two to 3% so, this is definitely the way how to do it and the good results for us to proceed it even further. Visually, to show you some results, here it is. These images were not used for the training, they were used only for testing. As you can see, in most cases, those spirals are identified pretty correctly. On the left side we can see one false hit on the left image, as well as on the right image.
We can see that probably like two piles are not ended well.
But there’s, and how to fix this issue,
there is several ways, but now we are implementing an interface for human expert, so he can join two days and he can fix those few piles. So definitely it will be not for the automated solution yet, but definitely we saved some time from human expert already.
So that’s it, that’s all from me. Thank you for your attention.
Tomas Vantuch received his Ph.D. in Computer Science and Computation Technology at 2018 from VSB - Technical University of Ostrava, Czech Republic. His research interests focus on bioinspired and soft-computing methods, deep learning and their use in complex system analyses and predictions. Currently he is working as a data scientist for Stora Enso company and senior researcher at VSB-Technical University. He is an author of more than 30 academic publications (65+ citations) of various innovative approaches in computer science as well as several speaks at international conferences.