In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Andy Dang: Hello everyone. My name is Andy Dang. I’m the lead engineer and a Co-Founder at WhyLabs. Today, I’m going to talk about a new approach in data monitoring with Apache Spark and an open source library called Whylogs. This is an overview of the talk. We’ll start discussing with the problem of ML data monitoring challenges, and then we’ll deep dive into some lightweight profiling techniques that can be applied to big data pipelines, and then finally, I’ll end with discussing the opensource Whylogs library, in which we’re building a standard for data logging. This is bigger picture in ML lifecycle from Google Cloud AI. As you can see, it contains a lot of components, and this is a slight evolution from where we came from in the DevOps world. In nowadays, when you deploy an ML model, you have various components, from data pipeline, infrastructure to metadata management.
The anatomy of an ML model pipeline is much more complex because of these components, and therefore there are many steps in that, and it involves a lot of data and metadata such as hyper parameters. Because of this, we have a gap in the solution for deploying models effectively in which we call the data tooling gap. There are solutions out there to address these gaps such as feed to store, model serving, model registry. However, when it comes to monitoring, we still apply DevOps monitoring framework on machine learning pipeline, in which monitoring individual metric points often fail to provide adequate insights into machine learning data due to the big nature, as well as high dimensional aspects of them. Also, complex data types, such as images and audio, and text, makes things more complicated.
These are some of the issues or subset of the issues that we have seen in production, and they’re small sample of that. One thing to notice is that a good number of these are original not from the production stage, but from pre-prod stages, but then they cause actual issues once the model is deployed. Majority of them, and surprisingly, come from beta. Not unlike DevOps where bad code will break your production application, bad data can also break your models. However, we often don’t view data quality control with the same weight as we view code review, for example. This is something I believe in the field of machine learning and data engineering we to change, and to those who work with big data pipelines and ML practitioners out there, I employ you to think about how you can start to control your data quality and your model performance from the end to end perspective.
How do we address this? We do not want to reinvent the wheel. There’s been a lot of lessons learned from the DevOps world, and we can take a few things there into the data world. Just like how programmers start to develop program with fins statements, we can also do something like that with data logging if we can figure out how to do it in a lightweight manner that includes adequate insights and information so that you can do this regardless of where your model runs. Model now run in very complex infrastructure, from a Spark ML Pipeline to things like your smartwatch or LTE devices. You really need a framework that can go across these deployment infrastructure while having a common language underneath the two, so with that, we can help build process and toolings around that. In the next slides, I’ll discuss what data profiling is a powerful technique when it comes to monitoring data and apply as a data logging solution.
Besides data profiling, another approach we’ve seen in production is sampling where people would take a sample of your production data set and then you can run analysis on top of that. This is just a walk around of the fact that data, when it comes to deployed data, is big and messy and therefore is very unworthy for a lot of companies to run analysis or very expensive to run analysis on top of the full data set. However, profiling, if you done right, you can achieve both scalability and lightweighness, and you can detect outlier events independent of the distribution of data. Sampling is basically you’re rolling a coin there, when if you get the right sampling strategy with your data, then you get a representative, but otherwise the sample just turns into noise. This is something I also want to talk about is that, when you sample data, you’re storing individual data points and that can come with privacy and security concern if you don’t have adequate protection for that purpose, whereas profiling, it works in an aggregate manner. You’d look at overall statistics of the whole collection of data points rather than individual data points.
It also comes with the benefit of not actually storing your users or your models data, and therefore is friendlier to privacy, security and compliance aspects of the modern ML business requirements. Another thing about profiling is that, when you apply techniques such as [Hyppolite Lock], which I’ll discuss a bit later, you can achieve really high accuracy compared to sampling, while again, keeping the lightweight performance and the memory footprint small. We ran some analysis on top of sampling versus the popular profiling techniques implemented in Whylogs and show the difference in terms of accuracy here. In terms of lightweightness, the scalability, we have a lot of big data nowadays, and this is the benchmark on top of Whylogs that implement various profiling strategies, and we demonstrate here that the profiling technique can be memory bound as well as IO bound.
What does that mean? That means that you can run the same library across a very large data set as to have the same output. The reason is that what we stole is not the actual data points, but again, the statistics on top of that. Did I say streaming? I want to emphasize here is that you get to be able to log your data regardless of whether it’s a batch of data set, it’s a streaming data set, or even when you deploy it to IoT infrastructure. When combined with a generic processing and distributed processing engine like Spark, you can imagine how this approach can easily scale to terabytes of data.
When talking about terabyte of data, I want to shift the focus a bit on how we think about this problem at scale. There are four key paradigms we want to call out here and they overlap, but touch on different angles of this problem. First is that we want to focus on approximations rather than collecting exact results. We want to collect these in a lightweight manner that can be additive, and the framework or the infrastructure around this to support both batch processing, think about Apache Spark use case and streaming support, again, Apache Spark use-case or Flink or Kafka.
We built an entity called profile, which is basically a collection of lightweight metrics and metadata that satisfied these qualities. Another added benefit on top of list is the privacy and data security, because a profile doesn’t actually contain your data, which can be very expensive and very sensitive and difficult to handle. If you operate in a environment with high regulatory requirements, interacting with profile is a much more safer way and privacy oriented way for your end users. First, lightweightness, what does that mean in the context of profiling? It means that you can run, in the traditional old approach, when you want to process a metrics you basically have to store it in the data lake or data warehouse, whether you do it in a sampling way or you do it in the data pipeline ELT or ETL way, you have to store it somewhere, and then you run a query on top of that.
This is good if you have a pipeline for that, but maintaining that is non-trivial and the cost of running such queries are expensive as well. A new approach is that, instead of doing that in a centralized way, you can start collecting data statistics as you screen through your data, when you process the data within the process, so you don’t have to add additional infrastructure. You just add some library or some sicor process or an additional process in your machine, and then you can run profiling. The result is not the data itself, but the profiles, which are much smaller in terms of size, can then be aggregated and collected into a store such as an S3 bucket or a DCF bucket for further analysis.
But this is only possible if you can do it in a fast with a very small memory footprint. The additive nature of profiles, again, because you’re running it in a very distributed manner, whether in a Spark job or whether it’s in a distributed real time inference model, you can add these result directly without having to do expensive IO operations. What I mean by that is, for example, calculating the median of a stream of numbers. That is a very common operation that I’ll use a relay on to detect data drift. In the traditional Spark way, you would have to sort through all the data to figure out where the medium is, and that requires a stuffle, which means IO tossed in Spark. If you have billions and billions of data points that is nontrivial in terms of the computation and the cost as well, because sorting can be expensive.
Whereas, if you have the profiling approach and these profiles are additive, then you can just add them together at the end, after you collect data independently, and this data don’t have to available right in the same pipeline, they can come from different sources, but you can still get a global estimated medium. Because of those properties, then you can apply this techniques in both batch and streaming infrastructure. What that means is that you can run profiling within Spark or Hive or Kafka, and then get the global profiles with very little cost added on top of that. If you think about a streaming use case at the windowing operation, you don’t need to recompute the result across different batches of data, but rather you can just reuse previous results in a much more effective manner, because again, these profiles additive so you can just merge them or, in Sparks language, aggregate them easily.
Finally, using the approximate statistics, we basically model this problem as a stochastic process. We use a Apache Datasketches, which is the open source implementation for this. They have done a very good talk around this topic, deep diving into the algorithm last year in the Data in AI Conference for 2020. Using these statistics, we focused on histogram, frequent items and cardinality. They provide really high fidelity in terms of the nature of your data. We found that when applied to real-world machine learning problems, as well as data pipelining management problems, we can detect very well problems of data drift and model issue, as well as even more complex problems, such as biases. Now, move on with discussing the Whylogs library, which is the implementation that encapsulates these learning here. Why Whylogs? Whylogs aims to be the data logging library and what we mean by that is that we want to build it for both data engineering and data science workflows, and therefore, out of the box, we support Python and Java with a single common underlying format so that you can run data collection in Java and analysis in Python, where you have to put the notebooks, for example, and the analysis tooling frameworks are much more extensive.
On top of that, we support complex data types, such as we have image support at the moment, and we are extending the Whylogs library to support text, videos, audios, and embedding in the future. We’re also trying to work what make whylog into various integrations, including things like Kafka, AWS SageMaker MLflow, and obviously Apache Spark. Basically, Whylogs is meant to be a drop-in library, such a similar to how people interact with lawful J in Java or Python log in Python. But instead of focusing on programs or application logging, we focus on data logging.
Whylogs in Python is a very common use case. This is a very easy way to try out the library because you only need a few lines of code to start logging, and the nature of Python is that you have the Jupyter Notebook environment in which you can just plug it in and PIP install it, and then get a sense of how the library operates. It comes with out of the box visualization utilities so that you can quickly gain insight into your data. In terms of the Apache Spark integration, which I wanted to talk a little bit more about here. Again, I think I mentioned about this before, but when you scan your data, you have various petitions disked in a disparate manner in a Spark cluster. What Whylogs in demands is that we have demands a custom aggregator on top of this hooking into your data frame.
We actually look at your schema as well as the other various information such as group by key, and then we build this collection of metadata. Within those metadata, we also build various what we call sketches and other metrics as well, so that for each petition, you get a single profile out of this. In the end, because as a custom app creator, I reduce that. Will merge most of these profiles into a global profile together without causing the data to shuffle. In case of Spark, it’s extremely efficient and we have really good performance results out of this and the final results contains information around your data, such as metadata, schema. You can even break it down into group by key, so that you get slice and dice view of your own data as well, and then the final result can be stored either in Whylogs’s direct binary form, or can be stored in Apache Parquet because the output is a Spark data frame object.
We take advantage of Spark’s Scala power in terms of building the API. With a couple of lines of code, you can import the Whylogs utilities, and then it will enable you to extend the data set API to add various metadata and run application and output a bunch of Whylogs profiles on top of this. Then you can choose to store it in S3 or in your favorite storage. Once you do that, then you can run the analysis in Python. Sometimes people want to use the PySpark directly without having to do it in Scala. We also have built HighSpot bridge so that you can run this analysis directly in the Jupyter notebook. If you have a backing Spark cluster, you can also run it with a local spot instance using HighSpot set setup.
This is also a data science workflow oriented mode, API thinking that we have in Whylogs. Finally, because of the streaming nature of Whylogs, it can be also run as an accumulator mode. If you don’t want to double scan the data, you can imagine just before you store the data to, say S3, you can put whylog accumulator in before the bright to S3 to Pacquet, and Whylogs will then collect the data there and finally, we have some API to extract the data out of that. Once the data is collected in store, then you can apply various analysis with Jupyter notebooks. For example, here’s an example of how a user might be able to cast distribution drift by looking the quantile’s distribution over time by passing the data and visualizing it in Whylogs. Check out our example of notebook in our repository for more information.
Another thing about Whylogs is that because it knows about the date schema in Spark, for example, or in case of Python and Java, we perform type detection, and because each feature can be really lightweight, we can afford to store all of the features across your data set so that you don’t have to configure it, unlike in traditional DevOps monitoring, where you have to manually configure things, for example, what metrics to collect. In case of Whylogs, actually, you aggressively collect all the metrics you want because the lightweight nature of the library, and then you hopefully can stall them and run more advanced analysis in, say, a Jupyter Notebook environment where you have a lot of access to libraries where you can do data drift detection.
Finally, this is just a very high level architecture diagram of what the monitoring layer on ML application looks like. Again, you start with machine learning telemetry data collection, and then you collect Whylogs wherever you run your machine learning and your data pipeline. That means that, in life inference, when you deploy your model with SageMaker or MLflow model serving, you can also deploy it in batch inference mode if you run batch inference, say, against SageMaker and Apache Spark, or when you build a model during the offline training and testing or within your Jupyter Notebook, then you can also run Whylogs on top of more advanced deployment mode, such as IoT sensor data, where the memory is much more limited. You can imagine storing this Whylogs profile on the SSD card and then ship them back once you have internet connection, for example.
That’s actually a common use case we’ve seen when it comes to deploying Whylogs to more performance and network sensitive devices. Once these are collected, you would want to store them in a storage system, such as S3, or hopefully you have something like a Hadoop or Parquet dataset. By tacking and partitioning them in the right way, you can then run monitoring and alerting, and visualization layer on top of this. When you have alerting and monitoring, you can then feed the signal back to your ML pipeline and trigger, say, a retraining when you detect the data in production looks very different from the data in training, or you can hook it into, say, a Slack webhook so that your team member will get alert or your later engineer will get an alert about some abnormal data behavior in production, and they can diagnose further.
Our vision is to make it really seamless to drop Whylogs, regardless of where it is. We’re starting at the data collection at the moment, but we are building our various integration and various workflows on top of this. We would love to have a feedback from machine learning practitioner and data engineers to build this in a true end to end manner so that you can really have full observability over your data quality. What are the next things? The next thing includes complex data type support. We talk about embeddings and NLP, and we would love to support even more complex, deep learning models. We want to simplify integration into various data processing and training frameworks. You can think of it as one single line of code to hook it into various framework. We’ve had integration with MLflow at the moment, but there are a lot of new frameworks popping up and would love to extend Whylogs everywhere. Extending Whylogs into workflow integrations, such as MLflow and Airflow is very important.
We’re actively working on that as well as real time model serving. We have that for MLflow, but there are other technology out there such as KFServing, TensorFlow Serving and [Taught Serving], and would love integration bigger. Finally, because of the nature of Whylogs, the aggregate statistics, we have a lot of opportunity to improve the data locking from the angle of encryption and privacy. We were thinking of things like homomorphic encryption or two way encryption support within Whylogs, so that plug in their custom key. That is something we’re actively working on as well, and thinking about that, would love your feedback and figure out the workflows in general. Check out the Whylogs library on GitHub. Would love to have feedback as well as contribution from users. We also hang out on our Slack channel, which is on GitHub page, or you can find me on Twitter. Finally, this concludes my talk. I hope you guys have a good day and I’ll hang around for a few Q&A questions.
Andy Dang is the co-founder and head of engineering at WhyLabs, the AI Observability company on a mission to build the interface between AI and human operators.. Prior to WhyLabs, Andy spent half a de...