A large Delta Lake frequently includes a mix of structured and unstructured data. Data teams use Apache SparkTM to analyze structured data, but often struggle to apply the same analysis to unstructured, unlabeled data (e.g. images, video). Teams are forced to use expensive and manual processes to transform unstructured data into something more useful –they either pay a third party to label their data, buy a labeled dataset, or narrow the scope of their project to leverage public datasets. If data teams had faster and more cost effective ways to convert unstructured data into structured data, they could support more advanced use-cases built around their companies’ unique, unstructured datasets.
In this talk, we demonstrate how teams can easily prepare unstructured data for AI and analytics in Databricks. We leverage the LabelSpark library (a connector between Databricks and Labelbox) to connect an unstructured dataset to Labelbox, programmatically set up an ontology for labeling, and return the labeled dataset in a Spark DataFrame. Labeling can be done by humans, AI models in Databricks, or a combination of both. We will also show a model-assisted labeling workflow that allows humans to easily inspect and correct a model’s predicted labels. This process can reduce the amount of unstructured data you need to achieve strong model performance.
Labelbox is a training data platform that allows companies to quickly produce structured data from unstructured data. Combining Databricks and Labelbox gives you an end-to-end environment for unstructured data workflows –a query engine built around Delta Lake, fast annotation tools, and a powerful Machine Learning compute environment.
To learn more, visit www.labelbox.com/databricks-partner
Nick Lee: Hi everyone. Thank you for joining Productionizing Unstructured Data for AI and Analytics, here at the Data and AI Summit 2021. My name is Nick Lee and I’m an Integration Lead for the LabelSpark project at LabelBox.
Chris Amata: My name’s Chris Amata. I’m a Solutions Engineer and Lead Developer for LabelSpark.
Nick Lee: And today we’re really excited to show you how we help companies take unstructured data and productionize that for machine learning and analytical workflows at scale.
Nick Lee: Many businesses possess a vast treasure trove of data, except this data is unstructured. It looks kind of like this. A trove of images, video and text. All of it about important things going on in the business. The problem is that when you take this data in its current state, unless the data is properly formatted so that the algorithm can understand it, the only thing you get is a confused algorithm. And companies are productionizing this unstructured data for AI and analytics so that they can derive valuable insight from that data and train AI and ML to provide recommendations and predictions based on that data. The benefits are huge. If we can tackle this problem, we can build security cameras that recognize crime or create software that can help doctors identify cancer or other problems in a medical scan. We can reduce manufacturing defects by catching product flaws as they move across the conveyor belt at thousands of products per minute. So this problem, if it can be solved at scale and efficiently, has the potential to really change a lot of things in today’s society.
Nick Lee: Now at LabelBox, we’ve worked with a lot of customers that have done this very thing and they all follow a similar pattern. They take their unstructured data sitting in the cloud, maybe that’s AWS or GCP or Azure, and within their Lakehouse environment, their data team will take that unstructured data, pass it to a training data platform where a team of labelers and subject matter experts can add structure and enrich the data with annotations. From there, the data comes back into the Lakehouse and the team is able to run it through machine learning or pass it along to a team of analysts for insight and analysis. And the more advanced companies are able to take their ML algorithms and apply them to new data as it comes in to pre-label data going into the training data platform, as well as provide opportunities to the human labeler to audit that algorithm and see how they are labeling data.
Nick Lee: So if a human can inspect the algorithm’s output, they can correct it and you can retrain your algorithm on it. So you get this virtuous cycle there. And today, we’re going to show you how this all looks in action with Databricks and LabelBox. So without further ado, let’s hop into Databricks and see how this works.
Nick Lee: So I’ve logged into my Databricks environment and for this demo, I’ve loaded a table called, Unstructured Data. It’s a very simple dataset. I’ve only got 20 rows here but I just want to give you an idea of what it is. It’s a bunch of photos from camera’s on a street. And if we take a look at one of the example images. There you go. It’s just photos of people out and about in an urban environment. And today, I’m trying to build a model that can recognize the people in the frame as well as umbrellas and cars. If there are any cars in the image.
Nick Lee: This is just an example. Your use case could be in manufacturing or medicine but let’s proceed and see how we do. So right now, the data is unstructured and it’s in a spark table. So what I’m going to do is create a LabelBox client. And with the LabelSpark library, I can pass my unstructured data spark table, right to LabelBox. So I’m going to call that command. And with just a few lines of code, I have already registered that data set in LabelBox. Now I want to point out that the data has not actually been uploaded to LabelBox. It’s still sitting on the Datalake. In fact, LabelBox, doesn’t need to have a local copy of the data. It can just reference your information right there on the Datalake. No need to move it.
Nick Lee: So within LabelBox, we have an ontology builder and this allows you to programmatically set up the ontology for your unstructured dataset. And ontology is just a set of questions that you’re going to ask the labeler about this image or video or text. So in this case, I’ve added a bounding box tool that I want the labeler to use to select all the people in the frame. I’ve included a segmentation mask tool. So people will draw over all of the cars and identify the cars that way and I’ve added a polygon tool to select all of the umbrellas.
Nick Lee: In addition, I’ve included a couple of classification questions. I would like the labeler to identify the weather in the image as well as the time of day, if possible. So just calling ontology.ad tool and add classification, I can create this ontology and attach it to my project. And now the dataset and the ontology have programmatically been set up in LabelBox and I can call up my friend, Chris, to help me label this data and get me back my labeled data set.
Nick Lee: So Chris, can I get some help?
Chris: So I’m first going to go in and label this umbrella with the polygon tool. So I’m going to zoom into the image a little bit and start pointing and clicking. And here, I’m going to just be able to come in and label this umbrella here. And we can see that this UI is really made so that labeling teams can quickly and iteratively build an umbrella and zoom out a little bit, grab a bounding box and start labeling these people. And I’ve now labeled both of my people. And then I’ll go in. And the last one I want to label in the objects is I want to start segmenting this image. So I’ll grab my segmentation tool and I’ll come in and I’ll zoom into these cars. And I want to start really splitting up the image.
Chris: And this is going to take a lot of time. So I can use this as a free hand tool. I can also point and click just like the polygon, but here, I’m going to use our super pixel tool real quick. And I want to start by just labeling these couple of cars back here. And it’s not going to be perfect right away but I’m going to go in and use the eraser and quickly point and click around everything so I can capture all of my masks. So I’m quickly and really easily drawing segmentation masks around all these cars. And anything that I missed, I can come in, zoom in and fill it right in with our pen tool. I can point and click, capture everything. And once I’ve labeled everything that I want to in this image, I can come in and start classifying it so we can look at the weather and we can tell that this is a little rainy because they have the umbrella. There’s some water on the ground.
Chris: And maybe I was confused if this was rain or overcast, I could leave a note for the reviewer and say, “Hey, Nick, is this raining? I couldn’t tell.” And Nick, being the rain subject matter expert, is going to be able to come in and do this review and see if this is rain or overcast so we can make sure it’s structured right. Time of day, this is going to be unknown here. And with LabelBox, once you create the project, you can go in and the queue will allow it to distribute to all different labelers. So once I submit this image, I’ll be sent a brand new one right away and I can go in and do the same thing we just saw. I can draw the bounding boxes around the people. I can draw my segmentation masks around some of these cars in the back. I can classify it and do all of this different stuff with the imagery.
Chris: Also, if we’re thinking about all different data types that we want to bring in to put some structure to, we can look at different videos. And in the way of unstructured data, we really want to look at how things are going to move in the frames and how they’re going to be able to adjust and change throughout the video. So within LabelBox, we can go in and we can interpolate over all of these frames. So I’ve drawn a couple of bounding boxes on a few jellyfish here… And I am a little fast. So what we can do is we can control the speed. I’ll bring it back down to normal time. And we can see all of these jellyfish are going to move and they’re going to move fluidly. These bounding boxes are chasing them. And right here, we can see that this one was maybe not perfect through the interpolation. So I’ll come in and make sure it moves a little.
Chris: And then we also know that there’s a lot of unstructured texts floating around the world. We’re going to look at some vacation chat bot texts because I know it’s almost summer and things are up and up and I want to go on vacation soon. So we want to look at name entities, where certain things are and then we want to also start classifying this text. So this can be in sediment. This can be vacation types and just anything that we want to add context to in our text chats and our messages and our statements and our PDFs, we can do here on the platform.
Chris: So here, I want to go find the location in Bermuda. I definitely want to go to Bermuda. So go in and highlight this. We can find the date. It’s January 1st and I’ll keep using this name entity recognition and labeling the text. Duration, a week. Once we have some structured data, Nick, things will be moving. We might be able to take that vacation for a week.
Chris Amata: All right. Thanks Chris.
Chris Amata: So now that my data set has been labeled, I can pull it into my Lakehouse environment by simply calling the Get Annotations Method here from LabelSpark. So I’m going to pull in my labeled data and let’s take a look at the first row so we can get an idea of what LabelBox returns. We include some metadata columns here. These are useful if you’re using some of the more advanced features in LabelBox, like Consensus, where you might want multiple people to label the same image in case there is some subjectivity on how to label it. You can actually average their responses and score different data rows based on the consensus. Now we also get information about who created this label, the data set that it belongs to and the file name here, StreetView1.jpeg. And LabelBox also has an issues and comments concept, where if I identify something wrong in this labeled image, I can flag it with an issue but it looks like this one has no issues.
Chris Amata: So over here, let’s skip over to the label column. This is the most valuable column here. And it includes the annotations for this particular image. And we deliver these labels to you as JSON. It’s an open standard and we don’t lock your annotations into some proprietary format. We give it to you in JSON and you can parse it as you will. So if we just pull apart this JSON here, it looks like we have a couple of classifications for this image. It’s an overcast weather image shot during the day and we have a handful of objects here. Looks like some bounding boxes, mostly. So lots of people here. And if we open up one of these bounding boxes within the spark data frame, we actually observe and we include the X and Y coordinates of that bounding box as well as some information like the height and the width.
Chris Amata: So this already is pretty useful to a lot of data scientists who can process this JSON but LabelSpark also includes a couple of methods that help make this information a little bit more digestible. So we’re going to run our table flattener to basically pull apart that JSON into separate columns. Some of these columns are a little less useful. They’re more for developers who want to keep track of identifying pieces of information within the LabelBox API. But we also get the array of classification responses or the array of objects in the frame. So if I scroll over here, we get a list of all of the masks that apply to the objects in the frame. So if I choose one of these masks… Let’s just pick this one here at random and load this. It will actually bring a PNG file with the mask over that object here.
Chris Amata: So this is the file that you can use with machine learning to identify where in an image certain objects occur. You can use this with segmentation masks to teach machine learning how to recognize specific objects at a pixel perfect level or you can use bounding boxes like this. So it looks like this one is definitely a bounding box. We have a weather column, the time of day as well as the people car and umbrella count. So if you’re interested in finding all of the images of a specific weather or day or maybe you’re only concerned about people, you can actually start to write queries against this table and filter down by those specific data rows. So for instance, let’s run this query here to find all of the photos that have people, cars and umbrellas and also rain. So it looks like this SQL query returned one image in this entire data set.
Chris Amata: And I can confirm there that yes, all of my conditions are satisfied. And from here, I can actually go over to one of these columns and actually get some of the objects out of this data row. So here earlier, we had seen an example of a data row with all people. This one actually has some umbrellas in it. So you get the X and Y coordinates of the polygons for each of those umbrellas. And looking at X and Y coordinates and bounding boxes and masks is interesting but maybe I want to quickly visualize it. I can take this link here and follow it back to that original asset in my Datalake. And over here in the last column, I get a link back to that label in LabelBox. And that should actually load all of the bounding boxes, segmentation masks and polygons that were on that frame as well as the classifications there. So imagine you’re a data scientist, you write this query and you want to quickly jump into that image and look at exactly what was highlighted. This is a really easy way to do that.
Chris Amata: So let’s do another query as an example. Time of day, let’s choose something with the daytime and wherever we have more than 10 people. So here we go. Looks like we have a lot of photos with people in them. Let’s take a look at one of the most populated images. So this one has 19 people in it. So I’m going to go over to my column here and view the label.
Chris Amata: Wow. That’s quite a lot of people. So it’s a very easy way to query your previously unstructured datasets and dive in and inspect them and power some machine learning workflows. So Chris is going to show us the next step, where we actually train a model to recognize people and umbrellas and cars and use it to label an example image that we’ve never seen before.
Chris: So in my Databricks notebook here, I’m just going to run through some code and I’m going to import a TensorFlow model that as Nick said, we trained on that training data of people and cars and umbrellas. And it’s going to go in and it’s going to take in a new unstructured piece of data from our Deltalake. And it’s going to use this TensorFlow model and it can be any model. This is your model environment. That’s the real aspect of the training data platform is to have your model be the one that’s automatically learning and importing this new data so it’s always being revised and improved upon. So once the models loaded, we’ll go in and use this ontology builder again, to build the same ontology. I’m including a handbag this time because I want to show the point tool.
Chris: And then we’ll create our project. We have a new piece of data that’s going to have some people, going to have some cars and we want to start structuring it. So I just created my project here with this new piece of data and called it MAL Demo. If you want to check this out real quick, let me just turn on Model Assisted Labeling. Once we turn this on… Lets go and check it out real quick. We’ll have an MAL demo ready for us. And you can see a lot of people, a couple cars, a bunch of umbrellas and it would be a really tedious time to put all those bounding boxes on them. So now I want to go and I want to include that TensorFlow model in this training data. So I’ll go back to this and I’ll bring in my ontology. So now I have all of my schema. So my model can relate to the schemas.
Chris: I’ll put all of these schemas in an NDJSON, which is basically going to take in our model inferences and map them to a LabelBox-ified version. And here I’m going to load all of those inferences into that LabelBox ready JSON. And it will take a second to run this command. And once this runs, I want to now show the visualization of the model. So we just saw what it looked like unstructured and it’ll take a second. And now we can look at our model output, all these bounding boxes, these certain handbags and these cars. Now I want to bring those into LabelBox. So let’s upload our model inferences from that NDJSON into LabelBox. And now let’s go hop back into the platform. And you can see your model inferences in the platform. And so here, we now have that TensorFlow output as training data and we just saw how it would be really difficult to go in and label all of these people.
Chris: So now we have these people pre-labeled and as a labeler, I’m going to be an iterator now. I’m going to be trading my model on the edge cases. So if I can zoom in, I see I missed an umbrella and this bounding box captured a few, too many people. So I can now come in and I can revise this. So I’m revising my model inference right now. I’m training that model that this is an individual and there was another person here. So I’m improving my model by this new training data. And once I submit this, it’s going to be awesome because I will go right back in. And now we’re going to be right back to where we started in that LabelSpark library, where I’m now bringing in from my project, I’m going to now go and display all of my annotations that I revised now. So I have this iterative active workflow where I improved my model output. I brought it back in and now it’s living in a new Deltalake for that next cycle of better performance.
Chris Amata: Thanks for the great demo, Chris. What we just saw was an example of a model assisted labeling workflow, where we took some unstructured data, annotated it and then produced a train model and use that model to pre label an asset that the model had never seen before. Then we went into LabelBox, corrected the output and from there, we can retrain the model and improve it. We didn’t have time to get to MLflow or Deltalake today but if you wanted to take this to the next level, you can imagine a fully automated workflow with MLflow, managing a lot of this orchestration and Deltalake with time travel, taking snapshots of your training data as well as the corrections that you made to that data in this model assisted labeling workflow.
Chris: So Nick, we got through a lot of really cool stuff today. And just before we wrap up, what ways can we make this better? How can we take this to the next level? What’s next with LabelSpark?
Chris Amata: Well, I’m glad you brought that up. In the spirit of this year’s conference theme, Open, we are excited to announce that we are releasing the LabelSpark library as an open source library under the Apache 2.0 license.
Chris Amata: Now the LabelBox Python SDK, which you also saw today is already Apache 2 licensed. So go ahead, take it and use it in your products and your projects. And we look forward to seeing what you can do with our technology.
Chris Amata: Visit www.labelbox.com/databricks-partner to get the library and documentation. We welcome contributions from the open source community and we look forward to working with you to help push the boundaries on what we can do with unstructured data, artificial intelligence and machine learning.
Chris Amata: Thank you. And we hope you have a great data and AI summit.
Nick Lee is a Senior Customer Success Manager at Labelbox where he helps AI teams solve challenging problems in computer vision and natural language processing. Nick also leads the LabelSpark project,...
Christopher Amata is a Solutions Engineer at Labelbox where he designs and deploys technical solutions for AI teams. He is also a lead developer for the LabelSpark project, a Labelbox initiative to ac...