As organizations launch complex multi-modal models into human-facing applications, data governance becomes both increasingly important, and difficult. Specifically, monitoring the underlying ML models for accuracy and reliability becomes a critical component of any data governance system. When complex data, such as image, text and video, is involved, monitoring model performance is particularly problematic given the lack of semantic information. In industries such as health care and automotive, fail-safes are needed for compliant performance and safety but access to validation data is in short supply, or in some cases, completely absent. However, to date, there have been no widely accessible approaches for monitoring semantic information in a performant manner.
In this talk, we will provide an overview of approximate statistical methods, how they can be used for monitoring, along with debugging data pipelines for detecting concept drift and out-of-distribution data in semantic-full data, such as images. We will walk through an open source library, whylogs, which combines Apache Spark and novel approaches to semantic data sketching. We will conclude with practical examples equipping ML practitioners with monitoring tools for computer vision, and semantic-full models.
Leandro Almeida: All right. Thank you so much for giving me the opportunity to talk at this conference. My name is Leandro Almeida, I am a senior machine learning engineer here at WhyLabs, and I’m introducing an awesome open-source tool that we’ve been working on, which is called WhyLogs, and specifically talking about our awesome feature that we added recently around image logging. And it’s specifically trying to understand what we can log in terms of semantic information for images, which is very important when we’re creating our machine learning models. Also, one additional tool that we had was an integration with MLflow, which makes it much easier for you to experiment and compare your models during an experiment, in production, and so forth.
So what are the main things that we’re going to be discussing? The first is why do we need logging with approximate statistics, specifically around images? And that is really to figure out how to scale properly to real-world datasets, real sizes, real scales, or when you are doing deployments. And then we go back and figure out why are we logging in the first place? What do we need to log? What are the things that we’re trying to keep track of when we have real machine learning deployments? And then we go back again and say, what are we going to be logging? Because not all… That’s very important information when it comes to specifically images, because there’s a lot of information that we can collect. Images are not as straightforward as maybe a series of numbers. There are interpretations going on and our models may make partitions that are not expected, or beyond the scope of what we intended the model to do.
So the first step is talking about approximate statistics, and how’s that going to help us to scale it to real-world sized data sets? Approximate statistics… You may be familiar already with min, max, averages, and standard deviations of all distributions, but there are other things that we can collect, which is really great. Approximate distributions… You can actually get a full distribution of your data in a streamable manner that is still approximate. It’s not an exact distribution, but it allows you to at least understand how your data is being distributed and compare it to different points in your pipeline.
But WhyLogs makes that even easier by just… You give it a specific feature and it will compute all the quantile associated with that feature, the approximate distribution, standard deviation, counts, type counts, top frequencies. And not only that, if that data is changing over time, we can keep track of number of nulls. If the type of data has changed, all the one that you expected even floats, all of a sudden there are strings, or if there’s a specific type of data like Boolean, or there are nulls associated with it, that we can keep counts of that and that’s very easy for you to go back and read it again. And one of the awesome things about any of these approximate statistic methods is that they’re constant memory footprint. And one of the great things that we have with the WhyLogs tool is it allows you to merge different sets together. So that means that you can collect this information in a very distributed way.
One of the main philosophies we tried to do with WhyLogs is to keep the setup as minimal as possible. And depending on the force of the use case that you have in mind, the package itself is open source and available right now at GitHub. And we’re constantly releasing new features. One particularly that we’re going to talk today, which is shown here on the screen, is logging images. You can log images directly from the path, or as a NumPy array, and you can do tons of feature transformations that you can log of that image itself. Some of it is already introduced within WhyLogs, and others you can add as custom information. And this is some of the things that we’re going to be talking about today.
And it’s even easier when you’re starting to do experiments, right? So if you’re doing experiments with MLflow, you can easily just enable logging with WhyLogs, and it allows you to log that information within your experiments, which allows you to easily compare between experiments or between your production test set and your deployed data set that’s streaming. And like I said, these methods really are for real-world sized datasets.
Even when you look at specific data sets you may find online like lending club data sets, or New York tickets, or pain pills data sets, which are very large in size, the memory consumption is constant because it really depends on the number of features that you’re logging. And again, the output size is also small and if you want to compress that even further because you need to send it to a monitoring solution, it requires very low bandwidth.
So what again are the steps? One was how do we scale that? That’s using approximate statistics, and that’s really already included in WhyLogs that you can use right off the bat. So this is how we are going to be logging. The question now becomes, why do we need to log? And then what are we going to be logging in the case of images?
So why do we want to log? I think one lesson that we all have learned over the past, potentially a decade or more is that testing doesn’t stop at your test set, right? When you deploy your model, you also may encounter problems that you didn’t expect at the very beginning, because changes in the datasets, changes in the targets that you’re looking into. So it’s very important to keep those in mind when you have a model that’s being deployed and you may not have access to the validation data.
So when it comes to monitoring deployments, these days, there’s a lot of names that comes with modifications and monitoring things that you need to do to your model, or may be affecting your model. That could be called data drift, model drift, concept drift, domain drift, or something that I like to call head to tail drift. And these may happen in a certain… If you look at over time of your model, you may have it in a certain shift, or maybe gradual, or may even have a seasonal dependence on it, that either maybe due to the problem that you’re looking at, or some other behavior that you’re not aware of.
And now there’s not a one-to-one relationship between the left and red size, but usually data drifts are associated with the input data being inherently different, right? Model drift and concept drift are facts where you are maybe affecting the targets themselves, and they are modifying their behavior where the targets are changing over time for because you use a specific window when you do your training, or when do your production set data testing. There are concept drifts and domain drifts, sorry, domain drifts, and head to tail drifts. Domain shifts are associated with your… the training side actually being biased compared to what you started with and head to tail drifts where potentially your… the task you’re looking at is actually dependent on only outliers. And you only have the head of the distribution to start with the training and the production testing. And so when you go through actually the deployment, all you have is trying to figure out outliers and your data set is not the same. It is inherently different from where you started with.
So what do we want to log? I mean, this is a very important information, and we’re going to focus mostly on images in terms of what we want to log. But in terms of your overall model, I mean the easy answer to this… You want to log everything right? You want to log every single step of your pipeline in terms of data coming in and out. Potentially, you also want to figure out a specific task metrics, right? Your model may be doing something specifically, but the task that you want to achieve is much greater than that. And usually people would call this business metrics because you’re dealing with models that are deployed for a specific business school. Those are easy because those are indicators, whether your model is serving its purpose. That’s very important.
But you also may want, ideally, if you want to figure out how well your model was performing, because you are… have a very good fit with your task is the performance, right? How well is the model performing? And that requires validation data, which you may not have during your deployment.
And so when it comes to inputs and outputs, what kind of information can we collect in terms of images. The great thing about images is these are files, and they may contain a lot of metadata information in them. That metadata information can give you a lot of insights into the specific semantics or the information, right? You may be also part of a specific regulatory process, or you may be tied to a very specific device to which you’re having your model deploy to. This, you can check on the metadata, or it can make sure that it can be written at a time. So also your pipeline may be dependent on sources of data that are beyond your control. So understanding the encoding, the raw resolution, the aspect ratios are very important. Sometimes the models are shifting to a specific size that is different from the one that you’re inputting data. Sometimes you’re downsizing the image and that’s okay. You’re losing some information there, but sometimes you are upsizing the image and that you’re including information that’s based on a specific interpolation model.
The aspect ratios may change, and that may actually affect how the aspect ratios your model are expecting. The great thing about images also is that you can look at different features besides just the information that’s there in terms of pixel data. There’s also image-based quality metrics that are based on some kind of reference sets that you use. There may be engineering features that you want to look into outputs. The outputs may be actually image-based or non-image based in case you’re doing some segmentation or actually detection in the case of bounding boxes or contours, or more other informations like key points, or maybe actually blending variables that you have in your model or lean variables or you’re passing out to another model to use, to contain some semantic information about your images.
In the case of file metadata, that could be again, device, encoding, raw resolution, aspect ratio. In WhyLogs now, you can collect all the XII information from an image file, and we’re constantly trying to add more information from different types of file formats. As you, you, you may know that image formats can come in their own specific varieties and they contain very specific metadata that are based on who created that file to begin with. And you can change sometimes the metadata associated with them. It’s very important if you have some kind of straight pipeline, you want to make sure the devices the same, especially if you’re dealing with microscopic data. You want to make sure that the resolution and size and zoom of the image is the same. Even though the image size may be the same and the aspect ratio may be the same, the Zuma, the image may have changed, and that information may be contained in the metadata.
And if it’s not, maybe a one good thing to point out to two, somewhere back in the pipeline to include that information, to make sure that there’s no changes in that so that your object sizes are actually pretty constant, or at least relative to the same size, especially when you’re doing some object detection in microscopic images, for example. But there’s a lot of information beyond that, that you could be very useful to try to debugging a process. We’re trying to understand what has happened to your data itself.
And the other one are featured distributions. These could be image quality assessment metrics, where you have some kind of reference set that you want to refer to, engineer features that you are computed, that are related to some specific feature of the image, like a blur, or it’s just looking at saturation to see whether you’re looking at landscapes or some more key points engineered features. These can be learned features that could be, that you have learned through the model like Layton values like embeddings, for example, or it could be an output like in case of you’re doing a summit segmentation where you actually have an image of the segmentation itself that you could process in some way, or the outputs themselves like the contour size of the objects or potential classification and scores, for example.
So in case of these features and distributions, usually in image quality assessments, you usually have a reference set you are referring to the quality. You expect the images to be this. This is very common in medical and microscopic industries, where you have a reference set of how good the quality of the images. And then you take your image. You compute a feature based on the comparison between your reference sets and the image set. And that is the information you will log.
One of the great things that we have in wild logs, or at least we added is the ability to create custom functions. So that means that you can not only log your image, but you can also create custom functions, which you can transform that image, or potentially create all these image quality assessment metrics that you want. We’re constantly adding new metrics that are, I would say, out of the box within white logs, but it’s very easy for you to add your custom one. If you have a specific thing in mind, or if you, for example, you want to add a model output associated with that, to your logging.
These features could be engineer learn and outputs. So again, when it comes to non-reference sets or non-image quality metrics, sometimes you may have a reference set that you want to refer to. And these, you then log separately from your image. And then you can, by looking at approximate distributions from the set that you currently have, you can compare your distributions and see if there was a specific shift. And now you can use your specific statistical test that you want to, it could be a KS test where you can see whether there’s an actual difference between the distributions themselves. It could be a KL test because that’s more tied to the specific information that you’re getting out of your model, or it could be simply like a Hellinger distance where you’re computing the distance between a specific means and shapes of your distribution.
And again, all of these can be customly created based on your needs, and you can just add Daisy chain them. The API is very similar to torch vision. You can just… Either they can don’t even have to be images. They can be tensors and num PI, as long as you’re transforming into forms, then you can log that information in that you can literally use as a reference point.
Also, one of the great things that we added on in terms of API is you can actually just add a folder and we’ll try to log all of those images are in as long as we can read them. And if we’re not read them, I mean, this is something that was a great thing about an open source project. We can easily add tools to be able to read those formats if it’s not available. And you can also add a whole list of features that you can compute, you know, brightness, saturation. This is just some simple examples of simple functions that you can either create or are already defined within Wildlands.
So when it comes to segmentation information, usually you’re looking at embeddings, for example. I think one thing that we… it’s really useful in terms of understanding what your model is logging in terms of semantic information is looking at the boundary between semantic categories. One thing that you can do in WhyLogs is look at those distributions, right? How far it is from those boundaries as a measure of your distribution. Usually when we look at embeddings, we tend to put embeddings in some kind of uniform sphere where we can easily look at this as this and compared them and the zero to one ratio to see, you know, distances that are objects that are very closely semantic in terms of metric learning are, you know, close to zero where it wasn’t a far away have a constant size, like say one in this case where the diameter of your end dimensional sphere.
There are many variables that you can compute. For example, one, one is really cool would be the pair distance between. It’s a one dimensional variable that you can compute between your dataset. Now that could be over your data, your entire data set. If you have the resources to do that, or it could be per cluster. That means that as you’re training, you can understand whether the semantic information is shifting or changing over time, and then that might allow you to get a better understanding of how your semantic information is actually distributed. Compare it, for example, from production to deployment, or you can keep some even simple one dimensional. There are not associated with two point correlations, but just one point correlations, which is the distance for each embedding compared to the center of that, of that cluster or category that you are, that you define. And you look at those distributions and see how those are changes.
And those centers of the clusters could be abstract centers that are actually geometrical, or they could be the closest, actual concrete example that is closest to that center. And that gives you something that you can go on from training to training experiment, to experiment or potentially to deployment, which is what you want to do.
And lastly, the one thing that you can definitely log as the outputs of your model, right? Definitely as your model gets outputs, it’s going to change over time and understanding why it’s performing in that way. It’s very important to look at the outputs of your model. Are they changing over time? Are… Is my confidence being constantly low and is potentially that could be due to something down the pipeline or the performance of the model itself, or it could be just due to the input data that’s coming in.
And if you look at the drifts and that potential inputs in the case of images, you can try to figure out exactly and debug the process as you go along, [inaudible 00:00:17:05] for example, look at the size of the objects we’re detecting, right? If all of a sudden the tail, the distribution of the sizes starts getting smaller, like you’re getting less and less larger objects, maybe something has happened. Maybe the zoom of your image has changed. Maybe there’s semantic differences in the image, and maybe it’s looking at something completely different than you started with. So these are the kinds of informations that you can quickly log with WhyLogs over time and quickly compare these distributions so that you can easily see if something has changed or shifted or drifted over time.
Again, just to go over the steps that we discussed, the goal for us is to use approximate statistics to really scale to these large data sets, which is very common in the images, especially since each image can be quite large, not only in terms of memory size, but also in terms of data that’s containing them. Approximate statistics… We went over them. These are collected automatically by WhyLogs for your features in case of images. And we’re adding more and more tools to use other types of semantic information like audio, texts. So feel free to come and join us and contribute to that.
Why are we logging? Right? We went over that. It’s very important to detect potential drifts in your data models when they’re deployed, they may not stay the same. The data may not say the same. Logging semantic information is particularly important because there’s information that’d be on the pixel values.
Again, it’s a scalable process. Constant memory consumption is very important when you’re having a deployment and you’re constantly getting data in, and you may not want to keep all that data, especially as you’re logging. So it’s right now available. This is a short link to that GitHub link that I showed before. Try today, help us contribute and really build a standard for why, what we want to log and how we want to log it. So I really hope you guys can join us.
Thank you so much. I really appreciate the support for WhyLabs and this whole conference for giving an opportunity to speak. If you have any questions, feel free to drop it in on the, on the GitHub, in the GitHub repo, or just send me a message. Thank you.
Leandro Almeida leads the data science team at WhyLabs, the AI Observability company on a mission to build the interface between AI and human operators. Prior to WhyLabs, Leandro helped build one the ...