Extracting texts of various sizes, shapes and orientations from images containing multiple objects is an important problem in many contexts, especially, in connection to e-commerce, augmented reality assistance system in a natural scene, content moderation in social media platform, etc. In the context of a scale with which Walmart operates the text from the product image can be a richer and more accurate source of data than human inputs which can be used in several applications like AttributeExtraction, Offensive Text Classification, Compliance use cases, etc. Accurately extracting text from product images is a challenge given that product images come with a lot of variation which includes small, highly oriented, arbitrary shaped texts with fancy fonts, etc .. Typical word-level text detectors for text detection fails to detect/capture these variations or even if they are detected, text recognition models without any transformation layers fail to recognize and accurately extract the highly oriented or arbitrary shaped texts.
In this talk, I will cover the text detection technique which detects the character locations in an image and then combines all characters close to each other to form a word based on an affinity score which is also predicted by the network. Since the model is at a character level, it can detect in any orientation. Post this, I will talk about the need for Spatial TransformerNetworks which normalizes the input cropped region containing text which can be then fed to CRNN-CTC network or CNN-LSTM-Attention architecture to accurately extract text from highly oriented or arbitrary shaped texts.
Speaker: Rajesh Shreedhar Bhat
– Hi everyone, I’m Rajesh Shreedhar Bhat, I am working as a Senior Data Scientist at Walmart Global Tech India. Today I’ll be covering, Detecting and Recognizing Arbitrary Shaped Text from Product Images. So this is how the agenda looks like. So initially, I had given overview of Text extraction. Later, I’ll talk about Text Detection techniques, and then go to the text recognition training data preparation task. Later, I will talk about different models for text recognition. To start with CRNN-CTC model, then Attention based model, then basically cover Spatial transmission networks. That’s basically for improving the text recognition accuracy. When there is arbitrary shaped text or code text. Then I’ll talk about model accuracies on different settings on different data sets. Then, finally, we’ll be talking on training and defining the models. Yeah, in order to extract the text. So basically, we need to know where exactly the text is present on the product image. So that is nothing but the text detection task. So after text detection task, we have the bounding boxes that is coming from the text detection model. And once we know where exactly the text is present on an image, So we can crop those regions. And we can send those crop regions to the text recognition task. So after the text recognition task, we have the Raw text that was present on the product image basically. So you can see the example below as well. So given a product image, text detection is done. And we know where exactly the text is present on the image. So you can see the bounding boxes over there. And once the once we have the bounding boxes, we crop those regions and send it to the text recognition model. And finally, we have the Raw text that is coming out from the text recognition model. So I’ll briefly talk about the use cases that we’re trying to solve in Walmart. So one of the use cases like seeing if there is any offensive content in the product image or not right. So we definitely don’t want to show those kind of images in the.com site right. And other use cases are, sometimes it’s so happens that the product catalog is not clean. And sometimes few of the attributes go missing, right? So in that situation, so we extract the text from product images, and then do attribute extraction on top of it. So attribute could be like brand or ingredients something right. So like there are like couple of use cases, which are like specific to retail and eCommerce. So apart from that, there are like multiple use cases where one can use text extraction. `So one of the example is like, let’s say social media content monitoring, right. So if you consider Facebook or any of the social media platforms. There’s so much content that is being uploaded on a day to day basis, right? And manually monitoring all those things is like a very difficult task. So we can have text extraction in place. Whatever image is uploaded, we try to see what content is present in that. Is it offensive? Is it against something great? so we can extract those information automatically and try to automate this task. So these are like a couple of use cases with text extraction. So now, I’ll briefly talk about the text detection task. Yeah, so in text prediction. I’m actually referring to a paper which was published in CVPR 2019. So that was from ClovaAI. So Youngmin et al, they came up with the paper called Character Region Awareness For Text Detection. So basically, the Text Prediction is achieved at a character level instead of treating the task at a word level right. Unlike many of the text prediction models that are readily available. So this is a slightly different since it’s achieving the text detection task at a character level. And then trying to combine the information and then form a bounding box on a board level. So, it’s a segmentation task. So, when it’s a segmentation task, then the well known architecture is like unit. So that’s a pretty famous and medical image segmentation tasks. So, here also the for the text detection, the architecture looks pretty similar. So, we have like VGG16, bashed normalized version of VGG16 as a backbone in the unit architecture. And like few up sampling blocks and definitely like skip connections, right. So, as you can see, given an input image. Finally, the output from the model is the region score and affinity score right. So, region score is mainly for telling us that where exactly the text is present. The character is present in the image, right. And the affinity score is mainly for grouping the characters. Saying that given two characters are they belonging to same word or not right. So, basically, these are nothing but the masks that are coming out from the segmentation model. In any typical segmentation task, we have an input image and we try to come up with the appropriate mask for that right. Basically, the object that is being in focus, right, So, we are getting the mask over there. Here the object that is in focus is nothing but the text. Okay. So, in segmentation task, since I told that we need mask in the training phase as well right. We need the corresponding image or we need the image and the corresponding mask for that. So, basically, since text detection model is giving region score and affinity score. We need masks for both of these tasks. So, as a training data, we have character boxes with this as you can see, in this image. We have the character boxes for peace. Word peace, right. And for each of the characters, we have the detection boxes right. So, that is the Ground Truth that is available to us. So, this is our open source of data that is called syntax data, okay. So, once we have this character boxes, we generate affinity boxes and finally we arrive at Region Score Ground Truth and Affinity Score Ground Truth using this Affinity boxes and Character boxes. Now, we will see how we get the Affinity boxes from the Character boxes. Take the two boxes from the two consecutive characters that is P and D. Then join the diagonals right. Basically this part right. We have after joining the diagonals for both of these characters P and D right? Then we get this triangles. Then find it centroids of these triangles and connect all these centroids. Okay So that is nothing better Affinity Box. Okay Affinity boxes tell that are these two characters are the part of same word or not right. So if you look at this image peace, right. And there’s also a word that why that is written W-H-Y right? There is no affinity box between the last letter E and the initial letter W right. Because they are part of two different words. So there is no affinity box over there. So once we have the affinity boxes and character boxes straight, then we take a 2D isotropic Gaussian. right? And we have a box a character box or affinity box, then we try to work this 2D isotropic Gaussian into that particular box. That is nothing but we apply perspective transform and then arrive at a transformed 2D version, right. So this transform 2D Gaussian is nothing but a mask for me for each of the characters and also for the affinity between the characters. Okay. So this is the input image, ignore the boxes for now. Peace is nothing better input image for us. And the ground truth for this images the region’s core Ground Truth, and affinity score Ground Truth. So this is the input and these two are the outputs. So these are basically the mask. Typically in segmentation tasks, we have like a binary mask. But in this case, we have a continuous like 2D version as a mask over here. Okay, and the last function is also since it’s continuous, so we’ll be seeing the people it’s mentioned that they’re using mean squared error as a loss function. Okay, you can refer to the paper I’ve given the reference here you can refer to the paper offline, and get more details on that. But I hope you got an overview of how text detection is achieved now. Okay. So this is the sample output which are taken from the paper itself. Given the input image rate that is on the top, you can see the region score and affinity score corresponding to that, right? So if you take the example of seafood and shack right? So there is no affinity score between the last word that is D and the first word that is S. Because these two letters are part of different words, right? So there is no affinity score here, we can see a gap over here. There is no mask over here. Okay? So once we have the region score and affinity score. Right? We can combine these two scores. And then we can use the functionalities from open CV that is connected components. And minimum area rectangle functionalities from open CV. To finally arrive at a bounding box, right? So as you can see in the top have a bounding box for each of the words, right? Yeah, so this was, like published in the paper itself, I have taken the scene. Now we’ll see on the product image how the results look like. Okay, so I have taken two sample products over here. So as you can see, this is a character score string. Basically, the character scores are present and using the affinity scores also. And we finally arrive at a bounding box for each of the words in the product image. Okay? So the only important thing I would say is like, there is a craft that is written on the lid of this product, right. So even if it’s not in a regular fashion, if you look at sandwich, and spread, right? So they are like, I mean, it’s easy to read rate for human eyes. But there is a word craft that is written on the lid, right? Even if it’s in a slightly slanted fashion, or in a different perspective, we are able to detect those things. Now the challenge is in like, okay, we are detecting it, but how do we recognize it? Right. So we’ll see, like, we’ll see through the slides, we’ll see like how this can be achieved using Spatial Transformation Networks. Okay. So next task is, yeah, once we have the bounding boxes. Now, next task is to do the text recognition. So we’ll see in detail how the text recognition task is achieved. Yeah, before diving deep into the techniques, I would like to focus on the data generation part. Training data generation part for the text recognition task. So basically, we are using the text product titles and descriptions and trying to synthetically create the images using that. Okay, so we are using a library called synth text. So it’s like given the plain text, right. The description or a title, product title. So we sample unit grams. And by grams. We take one word at a time or two words at a time. And try to synthetically create the image out of it. Okay? So we created around 15 million images, using the product titles and descriptions. And a lot of variations during the text image generation. Right? Was included in the data set. So the variations include like changing font styles, changing font sizes, different colors, different backgrounds. A lot of variations were included in the training set. Okay? Because in the product images also the text is not straightforward. Like we have a lot of variations in the product image itself right. So in order to mimic that we included those variations in the data generation processes. Right? So finally, like we had 92 characters in the vocabulary. which includes capital and small letters, numbers and special symbols. Because we included special symbols, because there’s a lot of special symbols status that goes into the product image kind of a setting. Yeah, the kind of data that is generated, right? It all depends on the intask results. So in this case, it was like extracting text from product images. So we took titles and descriptions. And since it synthetically created the data. Let’s see, we are trying to solve the capture. Then the type of text that goes into the training set in creating the training set is totally different, right. So it could be random letters, followed by a few numbers or something. It is straight through line or something right. So the kind of data set that we create totally depends on the task at the SynthText. Yeah now, I’ll talk on different techniques. Yeah, to start with, we developed CRNN-CTC model for text recognition. So I will not talk in detail on this, because this was already covered in my previous talk. That happened like few months back in Spark AI summit. So you can it’s already available in YouTube. So you can look in YouTube for a Text Extraction Spark AI Summit. You should be able to see the detailed explanation on CRNN-CTC Networks. Also, a blog is published in Weights and Biases. So there is a link to the blog, and also the you can just scan this QR code to go to the blog and you can take it offline. Okay, so we started with CRNN-CTC model, then we saw that it was not performing well if the image is a little blurry, right? Yeah, after the CRNN-CTC model, right. it was we are not getting good accuracies if the image was blurred and all right. So, that’s why we invented with Attention mechanism in place. So, basically, image is passed to a CNN model, then after passing to the CNN model, we get the features out of it. And then this visual features are fed to the LSTM encoder. Okay? And then from the LSTM encoder block, we get hidden states at every time step and these are used in the under for the attention mechanism rate. So, weights are given to this LSTM hidden units from the input block. And finally, this attention scores are utilized to focus more on certain part of the image and then the decoding is done in the LSTM decoder block. Okay? The intuition has here is like, let’s say, we are trying to extract the letter F, right? So we have a Forum over here, right? So the image contains the word Forum, let’s say we are extracting the letter F, okay? Now, I mean, the focus should be mainly on the letter F right? When extracting the letter F. So, we need not focus on the entire image right. So that is the intuition. So attention mechanism takes care of that and the weight age while extracting F right? Weight age would be even more near the region that actually is present okay? And similarly for the other characters. So, this is done in a generative fashion Okay? Compared to a CRNN-CTC model, right. So, that was done in a discriminative fashion. So, this is encoder decoder framework and that is like it’s achieved the predictions are in a generative fashion okay. So, one can also use beam search here to make the predictions better. So, basically in the decoder block for every time step. What we have is soft max probabilities over the vocabulary. So, as I said earlier, like we had like 92 characters in the vocabulary. So, for every time step we get the soft max predictions over the vocabulary okay. And using that we can come to a conclusion like what letter we are trying to get in that particular time step and then we can totally get the predictions out of it right. Since this is done in a generative fashion, we can go ahead and use cross entropy as a loss function. Okay. Yeah. So, later we saw that product images with code text right. So, we are not able to do like we’re not able to do pretty decent and extracting if the text contains arbitrary shapes or a code text okay. So, here are the use case was like offensive text classification right. On the T shirt images are like two offensive text and sometimes it happens that the text was in the code version okay. And if you look at the left image right. So, there you can see the brand Happilo is like written in a curved fashion right. So we were facing some challenges in recognizing with the attention Nora CNN-CTC network. We were able to do pretty decently on detection task that was achieved through character level segmentation. But on the recognition side, the model was not doing that great. So we went ahead with Spatial Transformation Networks. So Spatial Transformation Networks is nothing but a it’s a learnable module. So, its aim is to increase the spatial invariance of Convolution Neural Networks in a computationally and parameter efficient manner right. So let’s say we have CNN LSTM attention based model right. So we can pluck this model STN, just before the CNN Model and these are learnable models and basically they try to learn a fine transformations right. So as you can see in the image here, given the input image that is a code version. So after the spatial transformation network that has been applied. We have a rectified image. So the word moon previously it was called, then we have a rectified version of the word moon over here. So once we have that, the rectified image or a normalized image right? So that’s all process remains the same so we can feed it through a CNN-LSTM attention based model or a CNN-CTC based model okay. So basically, spatial transformation on networks helps in like transforming the current shape or arbitrary shaped image or a text to a normalized version. right. That is the main advantage of having Spatial Transformation Networks. Okay. So, as you can see here, so, I have put down the model accuracies different models on different data sets. So, till ICDAR13 so data data sets contain regular shape text only. There is no much variation in the test data, but after from the ICDAR15, which is highlighted the last 4 rows basically. So, that contains samples which include a lot of variations. Basically arbitrary shaped text or code text, those kind of variations right? So, when I say accuracy over here, right? if you look into the right, so, let’s say the ground truth is Hello and the predicted is Hello. If there is an exact match, then the score is given as 1. So, even if there is a mismatch in a single character, right? So, the score is treated as 0. Okay. Now, you can see on the regular shape text right? With CRNN-CTC models on let’s say IIIT 5K data set right. So the accuracy was at 81.6 and LSTM Attention based also was given pretty similar ones, similar accuracy basically, right. And if you go to the ICDAR15 or CUTE data set, right. You can see that CRNN-CTC or CNN LSTM model without Spatial Transformation had much lower accuracies. The accuracy was around like 65 to 66%. Right. But if you see with Spatial Transformation Networks, right. So the accuracy is like 20% increased now, right? So it’s around like 85% accurate now. And similarly, same thing holds go for ICDAR15 or SVT perspective dataset, right. So, you can see that spatial transformations have definitely helped us is helped us in recognizing the arbitrary shaped text. And that is like clearly evident in the accuracies that are presented over here. Okay finally like coming to the training and deployment part. So as I said earlier, like we created 15 million images. And so if we load everything into memory, that would come up to like 690 GB, given that the image sizes like 128 into 32 into 3. 3 is nothing better saying that it’s a color image. So we’re using data loaders from fighters. So basically, it’s actually a generative version generators are used. And instead of loading everything into memory, only a single batch is loaded into memory right. And the training is done accordingly. The training was done on V100 GPU’s 4GPU Spalletti. And once the model was trained, so we deployed these models on machine learning platform that is internal to Walmart. Initially, we had like separate deployments for text detection and recognition task. So later, we saw I mean, those API calls separate API call needs to be made to extract the final text, right? So it’s wastage of 2 API calls means like, I mean, sending the image twice over the network. So network latency is also counted in the overall extraction rate. So that’s why we clip both text detection and recognition model in a single deployment. And so it’s deployed on the V100GPU’s. And the prediction time for each image is roughly around 0.45 seconds. Okay, so yeah, this was on the training and deployment. And so this is a team behind the project, myself Rajesh Bhat, and then I have my colleagues, Pranay and Anirban and Vijay who are part of this project. Yeah, I hope you all got an idea on how to tackle text detection and recognition problem when texts of arbitrary shapes are involved right. So given the limited time, I tried to give an overview of different techniques. I hope that was useful. Yeah, the code and sample code for the content of the talk and the PPT is available in the GitHub link here. So you can refer to the GitHub link or you can scan the QR code. And if you have any questions, please go ahead.
Walmart Global Tech India
"Rajesh Shreedhar Bhat is working as a Sr. Data Scientist at Walmart, Bangalore. His work is primarily focused on building reusable machine/deep learning solutions that can be used across various business domains at Walmart. He completed his Bachelor’s degree from PESIT, Bangalore. He has a couple of research publications in the field of NLP and vision, which are published at top-tier conferences such as CoNLL, ASONAM, etc. He is a Kaggle Expert(World Rank 966/122431) with 3 silver and 2 bronze medals and has been a regular speaker at various International and National conferences which includes Data & Spark AI Summit, ODSC, Seoul & Silicon Valley AI Summit, Kaggle Days Meetups, Data Hack Summit, etc. Apart from this, Rajesh is a mentor for Udacity Deep learning & Data Scientist Nanodegree for the past 3 years and has conducted ML & DL workshops in GE Healthcare, IIIT Kancheepuram, and many other places. "