Wehkamp is an online department store with more than 500,000 daily visitors. A wide variety of products presented on the Wehkamp website aims to meet the many customers’ needs.
An important aspect of any customer visit to the website is a qualitative and accurate visual experience of the products. To achieve this, thousands of product photos, especially of fashion garments, are processed in the local photo studio. Since these images’ backgrounds are highly varied, background removal is one of the steps in the processing pipeline.
If done manually, this is very tedious and time-consuming work and when it comes to millions of images, the time and resources needed to manually perform background removal are too high to sustain the dynamic flow of the newly arrived products.
In our presentation, we describe our automated end-to-end pipeline which uses machine learning models for removing the background in images.
Data preparation: In the early beginning, after the dataset cleaning, each image was resized to 320*320 pixels. Afterward, we made use of kmeans algorithm to split the data into 6 clusters. We applied various augmentation techniques for classes with a low amount of images.
Background removal model: Our model is built on an architecture inspired by the paper: “U^2 -Net: Going Deeper with Nested U-Structure for Salient Object Detection”.
Training process: We worked in a Databricks environment and used workers with graphical processing units. Horovod and Pytorch helped us to make the training process distributed. To avoid OOM errors, for each epoch it was used a batch training technique. The trained model is stored in S3 bucket.
In this speech, we want to share how to create an efficient pipeline for deep learning image processing within the Databricks environment.
Simona Stolnicu: Hello, we are so happy to be here. We want to share with you how we managed to build a step-by-step pipeline for automatically removing background for e-commerce fashion images. My name is Simona, I’m a data scientist at Levi9, Romania.
Oleksander Miro…: And I’m Oleksander Miroshnychenko, I’m a machine learning engineer at GlobalLogic, Ukraine.
Simona Stolnicu: For the last year, we’ve been working for building solutions, computer vision solutions, for one of the biggest se-commerce companies in the Netherlands, which is Wehkamp. In our presentation, we are going to a short description of Wehkamp. And also we’re going to discuss about the problem that we are approaching. We’ll go also a deep dive into how our data looks like, how we mobilize it, and also about the machine learning models and our results. Feel free to text any questions you’ve got in the chat, and we’ll make sure to answer you after presenting. As I previously said, Wehkamp is one of the biggest se-commerce companies in the Netherlands with more than 500,000 daily visitors, on their websites. And here on the slide, you can see some of the most important figures, describing the company’s activity from which I can emphasize that by 65% of the customers shop in mobile, and also 72% of the customers are female.
There’s a wide variety of products on Wehkamp’s website and they aim to meet any customer needs. But that’s why Wehkamp puts seller effort in order to provide to the costumers qualitative and also accurate experience in visualizing the product. That’s why all the fashion images are processed in our local photo studio, because the background of these images can be a highly variety. One of the biggest important step in the image processing pipeline is to remove the background. In the past, this was done manually, and it was a very tedious and time consuming work, which when thinking of millions of images, it can really slow down the dynamic flow of the newly arrived products. That’s why our solution was to build the machine learning model, which can dramatically decrease the time that images are processed in the local photo studio, and also to increase their quality. So having very large data sets containing pair of images. So as you see on the side, so on the left side, you see examples of original image and on the right side, you see example of manually labeled masks.
So our goal is to train a machine learning model, which is able to distinguish between the foreground and the background of the image. And when it is inferenced with a new image, it is able to return a result that we can use to remove the background from the image. Before letting my colleague continue, I’d like to give you an overview of the pipeline that we are using in our image processing. So we’re starting with this first data like you saw in the previous slide, we apply some steps for the dataset where they are cleaning, analyzing, and clustering. And we also do some resizing and transforming the data in order to feed our network architecture. For that, we use the libraries that you see listed there above and for the next step for the modeling and evaluation, we use PyTorch and also employ Horovod, which helps us to make our training distributed. We also store our experiments using MLflow and the final best models are stored on the S3 buckets. For all of that, we use data breaks, clusters and also notebooks. I hope this pipeline gives you a clear overview of what we are trying to achieve.
Oleksander Miro…: Okay, let’s start from the dataset as the source data, we had raw images. No, not raw images. They had four layers, RGB plus transparency layer. So we decided to split them and stored separately on the DBFS. As an input for our model, we use image and mask. You can see examples on the slides on the right side, it is mask on the left side, it’s the raw image. Then we decided to split the data sets into six categories for balanced dataset. For five categories, we used PCA plus Kmeans algorithm, and one category was labeled manually. As a training sample we have 25,000 images. As a validation set, it is a number of 6,000 images. And what is our goal? Our goal, we receive raw image. We generate our label, or mask for this image using deep learning model. We apply this mask and we want to risk to have image with the highest quality, in order to put it right to the site.
In this example so you could see on the right side, it is on the left side is raw image, on the right side, it is image plus a hand made mask. Okay, let’s go on. So about our data pipeline. First step was an analysis of the data. We have spent a couple of days on it, trying to understand what they have and what to do. After that, we did cleaning, deleting some corrupted images with bad colors, with bad masks. Then we cluster them into six categories. First long pants, shorts, then short-sleeved tops, dresses, then, long-sleeved tops. And the rest fits the category is a beach wear and there were sports wear accessories and so on. And the last one, it is a category with white products and white background. Labeling was done, unfortunately manually. Well, then we decided to store resized images on our DBFS. Size of them is 320 multiplied by 320, these hugely decreased time of training, approximately in three or four times.
Then some of the clusters are not huge enough for training. So we decided to use an augmentation. In our data set, we use original images, vertical flipped images, images with cropping, and a combination of images. We have discovered that there are images with multiple items on it. Okay, a little bit about the model we were inspired by the paper U squared net: Going Deeper with Nested U-Structure for Salient Object detection, written by the scientists from University of Alberta, Canada. On this slide you could see the architecture. It is a U net inside the unit. We trained the network in environment using Horovod, which allow to put in place a distributed training on GPU with PyTorch. During each epoch distinct batches are trained on multiple workers. And the results are merged by averaging the parameters between workers.
About the model, we have parameters setting. We used Adam optimizer, dynamic learning rate. Batch size is 10 multiplied by the number of GPU’s. And number of epochs is 30. We also used Batch learning techniques. We want to avoid out of memory error because after the augmentation, we have approximately 100,000 images and it is impossible to load everything to the main memory. So we load images by batch by batch for each epoch. We used this technique, the time of training, only nine hours with 20 workers using Horovod. And each worker has its own GPU. As evaluation method, we use Intersection over Union metric and also we store losses, learning rates, intersectional union scores for each epoch for validation dataset. We use TensorBoard for that.
I want to spend a little bit more time on that, our intersectional union metric. Let’s consider the particular example of skirt. We have truth label and we have predicted label. And we want to compute the quality of the predicted label. How can we do this? So let’s use intersection over union. These two pictures are consist of zeros and ones. We intersect these two images and after that, divide by their union. On this slide, you could see the particular example on rectangulars or goods and excellent intersection over a union of square. And in this case with skirt, we have 99% of the quality. And as you can see, when we apply the mask on raw images, the image is good.
Simona Stolnicu: So when averaging all this scores for the validation data set, our model has a performance of around 99.4, which is a really great score. We can look at this results from two perspective. So first one would be by looking at the amount of data that is achieving certain scores. So for example, we have listed on the left side, 95% of 95% performance is for around 99.3 of the images. Also 97% performance is for 98.7 amount of the images. And when you look at the higher score performance, like 99, the amount of images achieving that score is around 93%. It is important to look also at this, from this perspective, because it show us how, how good our model is on the data.
The second perspective that we can look at the results is by looking at the cluster, the clusters that we made. So the first 4 ones are the highest in performance with ranges between 99.4 to 99.6. They are achieving the best score and they are for long pants, shorts or short or long-sleeve tops and dresses. A bit lower score is for the beachwear, sportswear or the white color clothes clusters. The score is around 98.8. It’s still a great score, but it’s a bit lower because in this example, the products are presented more difficult features on the product. But these are just numbers, so let’s take a look of some examples from the validation data sets.
So here on the first column, you can see the real image, on the second column, you can see the resulting image after removing the background by using the manually labeled mask. So we should consider this, the reference point, like the ideal, how the image should look like, the resulting image. And on the third column, we have the resulting image after removing the background, using the machine learning predicted mask. So it is what we achieved. Looking at the score, we have a score of 99.2 for this example. From our perspective, this example is a hard one because of the contrast colors in the image. We have a white pants and also the background is really similar with the product. And that is making very difficult point for the model. But we are still very happy with the results for this image.
Going on the next product here, we have a dark color product, but we can notice that the contrast of the colors in this image is very high and that helps a lot to model. And that’s why the score there is really high 99.5. Maybe a thing to notice here is that the edges of this product are not that sharp, are not that regular, better said. But in this case, the models do a great job.
For the next example, we have some blue pants. Again, the contrast is good. And the result is around 99.4. But maybe you’re asking why the score is not almost 100%, or even exactly that. That is because the pixels on the edges of the product are not always matched with the pixels from the manually labeled mask, to which the prediction is compared. And that’s why with this score.
Let’s take a look at the next example. The first thing we see on this slide is the very low score, which is 80.1. But when we look at the machine learning predicted mask, the result looks really perfect for us, right? So let’s take a look at the hand made mask. We can see that there are parts of the image, which are not correctly background removed. And that’s why when compared to the machine learning predicted mask, they are so different, and that’s why the score is very low. This is a very interesting case because it made us aware that we have some bad images in our validation data set, and they should not be taken into consideration when computing the performance of the model. Another thing maybe to notice that the real image is that its a bit strange, but at the back of the product, maybe shouldn’t be visible on the mannequin, right? So I think this suggests that our model is also able to work very well on maybe not so natural products… not so natural images, more precisely.
All right, let’s see the next screen. Here we have a very high contrast between the background and the product, and that is giving us a very good score, which is 99.6. Maybe the hardest part when prediction, the mask for this image are, is the inner spaces which are between the product and the background. So also in this case, the mobile perform really great, I will let my colleague continue with some other examples, which were not, maybe that good.
Oleksander Miro…: Yep. Let’s start with this example. We have white background and white product. I think it’s hard to distinguish the foreground and background even for human. Also, we have a mask and as we can see machine learning model perform not perfect on this image, we could see some problems on waist or with hand. And that’s why we have so low score, 97.8%. So not the best case.
Let’s continue, this example much better. Let me notify that this image was pre-processed, little beads highlighted on the bottom, you could see between short and sleeve. And we have handmade mask, it was good. And machine learning mask is also good. So the quality is 99.8%, so a good case.
This case is not good, we have bad mask. We have white background, and despite this, the model works well. The output is mostly perfect, despite this bow on the bottom. So the quality is 81%. So probably it’s better to delete this example from the dataset.
Next, one more example. Also we have problem with the images with white product and white background, because of low contrast. So you could see that model give output not good for us. You could see these sharp edges on the mask. So the quality only 98.1%.
And one more good example, it little bit pretest. It has some beads on the sleeves. That quality is huge because of contrast, so quality 99.7%. Okay, let’s consider the next example. It’s interesting one because we have a pattern and we have complicated shape on the bottom. Despite this, we have 99.5% of the quality and I’m not sure that a human could easily to find differences between these two masks.
And the last example for today, it’s the jeans for the pregnant. This black part of the cloth is essential, and we have bad masks. Despite this, the model gave us very good quality. Of course, the score isn’t huge, but, the hand made mask wasn’t done good. So also, let me say that the bad examples on these slides, we specially choose them. So in average we have very good results on the model, mostly perfect. But some of them, some of the images, couldn’t perform well with this model.
Let me talk about the challenges. During this product, we had some problems. First was the data set cleaning and clustering. We spent a lot of time on cleaning data set and also in clustering. Clustering was partially done manually because we need to find white clothes with white background and also move some images from one cluster to another because PCA plus Kmean algorithm doesn’t perform well sometimes. Then we spend some time on finding a suitable architecture for the network. Our previous trials were MaskRCNN with average quality, 93.7%, and BASNet, with average quality 94.4%. At this moment use U squared net has a quality 95.5%. That’s why we decided to choose it and improve. We have spent a lot of time trying to do parallelization with Horovod, using GPU, the main problem were out of memory error, and when computing additional metrics, we try to average or consider all the results from each cluster, from each worker. [inaudible]
And, and after that we had problems with appropriate argumentation to solve some problems when the model has weaknesses. For example, for white clothes, we try to do a histogram equalization, trying to change the contrast scores of the images, or replace the RGB among each other, to increase the differences between foreground and background. Unfortunately, these techniques, it couldn’t help us, unfortunately. And the next, our… a huge problem, trying to prediction a mask, when the cloth has the same shade as the background. For now, this problem solved partially, and we think about training environment, more model for these type of data, and includes to our pipeline. And the last thing, one to talk, it is mask verification on the production. For now, we have a choice, to create a model, or a neural network to make a verification of mask on a production. Or, insert human in the loop. But of course we want to try to automate this process. Yep, nice things, that’s it. Thank you for attention. And let’s go on a Q and A section.
My name is Oleksander Miroshnychenko. I am a machine learning engineer at GlobalLogic, Ukraine. I work on developing and applying machine learning models for solving regression, classification, text g...
I’m a data scientist at Levi9, Romania. I work on analyzing data and creating solutions for different business problems using machine learning. Recently, my work focused on Computer Vision problems ...