As the importance of data grows and its connection to business value becomes more direct, data engineering teams are increasingly adopting service level agreements (SLAs) for how they deliver data, covering new factors like data freshness, completeness, and accuracy.
In this session we’ll discuss how to use Deequ, a data quality library that’s purpose-built for Spark, to develop a data monitoring and QA system that will enable you to meet SLAs guaranteed to your analytics users, scientists, and other business stakeholders. We’ll cover how to use Deequ to create quality checks that report metrics and enforce rules on data arrivals, schemas, distributions, and custom metrics. We’ll cover how to visualize, trend, and alert on those metrics using pipeline observability tools. And we’ll discuss common challenges that teams face when setting up data quality logging infrastructure and best practices for adoption.
We’ll use common examples such as machine learning, data transformation, and replication pipelines (such as moving data from S3 to Delta Lake).
With these tools, you’ll be able to create more stable, reliable pipelines that your business can depend on.
Josh Benamram: Thanks very much to everyone for joining our session. We’re excited to present to you and share our perspectives on this really interesting, and we think really important topic for data op. How to guarantee that you’ll meet your data quality SLAs. SLAs are an important tool, really to improve the quality of your data products. And in this presentation, we want to cover, first of all, some context on what data SLAs are and why they’re important. We’re going to describe an example of data pipeline and discuss where we use these tools, Deequ, and Databand to measure and monitor those SLAs, and ultimately, deliver more reliable data products. So, first of all, I’ll start with some background on who we are. I’m Josh I’m CEO, and one of the co-founders of Databand AI. I’m joined by my teammate Harper, a data solution architect at Databand. Our company Databand provides data pipeline, observability systems that help data teams deliver reliable data products.
I personally, in my own personal background, I’ve jumped between a lot of different kinds of roles, always with a really common theme in data. I co-founded Databand to help data engineers solve a lot of the issues that I struggled with in my own previous roles. Before Databand, I was a product manager at a big data analytics company called Sisense where I built features in our product for data engineers and analytics engineers. And prior to that, I was an analyst in the finance world working first at a hedge fund and later at a venture capital firm, always very focused on the data landscape. I’ll pass it off to Harper to tell you a bit about his background as well.
Michael Harper: Thanks, Josh. Everyone, my name is Michael Harper as Josh alluded to, I go by Harper because there’s simply too many Michaels in this world. My background is in mathematics and economics and accounting, and I studied those different disciplines for a few years and I kept feeling like something was missing. And I realized about five years ago that it was data. Like there’s all those disciplines are really revolving around data, but they don’t look at the way that it is modeled. They look at the way that it’s recorded. And ever since I discovered data warehousing and modeling and data engineering, I just kind of fell in love with it. So, for myself, I started off in the business intelligence space, working in data warehousing, mostly in the Microsoft space. From there, I moved on to a company called Cvent where I was working as a quality engineer on our integrations team.
We were the only team that was really looking at data and how to test data. And it really, opened my eyes to the lack of tooling and the data engineer experience. And from there, I moved on to consulting, working mostly in the natural language processing and the data ops space for the contracts that I was on. And again, our clients constantly came back to us about cataloging and how do we take care of these things and how do we ensure that we’re going to deliver the data we expect and the product’s going to be as good as we expect it to be. And when I talked to Josh about what Databand and was trying to do and the product that we were building, I realized that this is the tool that I’ve been lacking for five years. And so I’m really excited to work here and to help our company really provide a better developer experience to data engineers because they really deserve it.
Josh Benamram: Thanks, Harper. Okay, so moving into the real topic of discussion today, and I want to start with defining our terms. So what is an SLA? An SLA or service level agreement? It’s a contract that sets the standards of how a service is delivered. So very common in the enterprise software world, where when a company delivers their software to a customer or group of consumers, you’ll find an SLA that guarantees certain stipulations of how that software needs to be delivered and the standards that it needs to meet. So common example is uptime for a cloud application, right? An SLA will guarantee 99.999% uptime. And, at the end of the day, what these guarantees do is they help create trust between a service provider and a consumer. As the data economy grows, the need for SLAs is becoming more and more critical for data teams who are producing products and creating products from their data.
So, when we have more organizations that are relying on data and building products from their data, we’re finding that more and more teams are actually incorporating data SLAs for their data products. Whereas software SLAs obviously are going to make guarantees around metrics for software products, data SLAs need to guarantee metrics around how data quality is measured. And that’s really something that is in the process of being defined across many data organizations today. Likewise, with software products, data SLAs are about creating trusted data. And we think that this is a really crucial ingredient to creating a high performing data team and the offsite data-driven organization. We think this is really critical because it really sets a literal bar for how a data team needs to operate.
So, data SLAs are needed for a high performing data team. Now, the question is, what does the SLA actually consist of? So for teams that are really defining this for the first time, we do have some guidance of the main pillars that we like to look at when creating these kinds of standards. So I’ll walk through a few of those main pillars and example metrics that map back to them. The first pillar that we like to look at is your uptime, which answers the question of whether your data is up to date and accessible. So, if you have a table in Snowflake or a file in S3, that analysts depend on daily, is that data arriving on time to that location when it should? The main metrics that we want to watch here, or some sample metrics would be the actual landing time of the data.
So if that file is expected every morning at 6:00 AM before business arrives, is it getting there on time to that location? Another main metric that we’d like to watch would be the durations of jobs that are actually writing the data. So the pipelines themselves that are producing this data. The reason why that’s important to look at is if the duration of a pipeline exceeds some normal boundary, the data’s never going to get written to that location on time, right? And we want that kind of information because it can give us really good leading indicators to a potential SLA miss of uptime. Next thing that we’d like to look at is the completeness of the data. So is all the data arriving in the right format? We know data is getting there on time. Next question is, is all the right data getting there?
The kind of metrics that we’ll look at for completeness would be things like the schema of your data, the counts of your data, the null counts within a dataset and making sure that that’s within an expected boundary, say 90%, 60%, 95%, depending on what standards you create for your team. After we know that all the data is being delivered on time and the data is being delivered in its full entirety. The question moves to fidelity, which is to say, is the data that’s arriving accurate? Is it true? Does it actually reflect reality? And the kinds of metrics that we’ll look at here would be things like distributions, skews, potential outliers that surface within the dataset that indicate whether or not it is accurate, or that it might have some issues related to the accuracy that we want to take a look at.
Finally, last pillar of our SLA. After we know uptime, we know completeness, we know, we know fidelity, last column is going to be about remediation. If there are lapses in our SLA, so if we miss any of the standards that we put to uptime or these other categories, how do we quickly recover? How do we find and fix the problem that’s come up? A great KPI to look at here is tracking how many SLA issues were discovered and resolved before our data delivery time, because that helps us get a sense of how effective we can be at solving issues before they even really impact our downstream data consumers.
So, if we expect that a table should be written into Snowflake or our data lake at 6:00 AM before business arrives, and we start to see that a pipeline is missing a duration milestone for a task early on at, let’s say 4:00 AM, do we have a good process for resolving that problem, and restarting the pipeline let’s say, and making sure that the dataset is still delivered on time? Solving those kinds of potential SLA misses before they actually impact our users downstream. So, four of the main pillars that we like to look at for meeting our SLAs, and with that, I’ll hand it over to Harper to dive into our actual use case here.
Michael Harper: Thanks Josh. So, one thing I wanted to highlight with SLAs is that I’ve talked to a lot of different data teams and there tends to be a lot of anxiety or nervousness around committing to an SLA when it comes to their data. And one thing that I would point out is, for me, it’s an opportunity to really show what we want to focus on and how we can deliver for our client. It really can help grow a team and mature a team. It can help define things like a definition of done and how to get those really core values of a team working in the right way so that you end up doing the work you need to do without having to think about the SLA, because you’re just writing great code. And so for us to be able to demonstrate that we decided to make a small pipeline as a sample here so that we can demonstrate how we use Deequ to track the metrics that provide us with the data quality.
And when it comes to Deequ, we really focus in on those two middle pillars, the completeness and the fidelity. In combination with Deequ, we’ll be using Databand as a way to monitor this information. And Databand does a great job of really allowing us to look into the uptime of our data between the execution of our pipelines and some fresh dis-metrics that we’re capturing in the background. But for the pipeline that we built today, I mean, everyone knows that there’s five easy steps of data engineering, right? You get the data, store the data, you transform the data, sprinkle on some secret sauce, and then you make a profit, right? It’s just that easy. In order to get there, it felt like the best thing in 2021 to make that profit would be to do some gambling, I mean, investing in the stock market. So, we built a small pipeline where we’re going to go grab some API data from alpha vantage.
We’re going to drop that raw format into S3 and parquet format. Then we’re going to eventually transform that data, put it into our warehouse, allow it to be analyzed. This is a little visual representation of the pipeline that we’ve built out. It’s probably more broken apart than it needs to be. There’s some pieces that can be brought together, but we really wanted to isolate each item, make it really atomic and make it really clear as to the different steps that are occurring so that we can talk about where we can have the opportunity to enhance our pipeline, to support our SLAs and improve the data quality. Because at the end of the day, when we have a new dataset, how do we ensure that the data you receive is the data that you expect? And when it’s not the data that we expect, how are we going to create a data quality framework with minimal effort, maximum value, and also provides accessible metrics which will support our SLA?
Not just give us information that’s kind of interesting to have. For us, Deequ is the answer to that question. First, we’ll talk about Deequ and then talk about why we’re using Databand, but Deequ is a repository built by AWS labs. It’s built in Scala. It’s natively works within Spark. So it’s built to Scala, it’s built to be performant. And it also is kind of really designed at doing that unit test, either columnar or row level of your data that is really difficult to achieve. There’s a lot of really nice features that come out of the box. And so we really think that it captures the metrics in the easiest way possible with a nice little code integration, we won’t call it no-code. The one thing with Deequ, though, is that it returns all of these really rich metadata objects that you have to then extract information from.
And Deequ provides in memory repository is that allows you to do some really interesting, incremental testing of data, but at the end of the day, it’s just another thing that gives you logs, and anybody who has worked in a data engineering team, we know how many logs there can be to kind of walk through, and understand what the problem will be. So, for us, we really feel what’s important is to have a good monitoring tool, and that’s where Databand comes into play. So we’re going to use Deequ to grab this information, and then database is going to be our metadata repository. Databand’s a pipeline observability platform. We have a lot of open source integrations with Airflow and Python and Java and Scala. We have an open source SDK that you’re welcome to go use today. We’ll have a link in our sources later, just implement it straight into your code, and you can start logging the metrics that you think are important with your pipeline.
You store that into your metadata repository, and then you can access that repository through the Data bandwidth application, and then view those metrics as they change over time. And from there, we can set up alerts and do things like anomaly detection to really give us insights into how our data and the profile of our data is changing as well as alert us when something is outside of the norm, so that we can react quickly and really get ahead of those remediation objects. So, again, Databand here is really helping us with that uptime information while Deequ is really capturing that completeness infidelity that’s going on. Here’s a little quick diagram of how Deequ is built. You can see up to the top, you are interacting with the data quality constraints there. Those constraint suggestion, and constraint verification, those are the main items that will be used for creating a data quality rule set that can be running as a change in your dataset anytime that occurs to ensure that you are meeting the data that you expect it to be.
Another two models that we’ll look at are the profiling that Deequ offers right out of the box, as well as their analyzers tools. But as you can see, Deequ wraps all of this information, runs natively on Spark and then sends out the data quality reports and those metrics directly over to Databand. Now, the question we have to ask is like, “Where is the best place to integrate Deequ with our pipeline? Why do we want to do it at certain places?” As you can see here, we added these gates, essentially, that occur at each of our steps of the pipeline. The reason we’ve done this, we want to make sure that we target the areas in which our data is going to change, because we need to ensure that as our data changes, it changes in the way that we expect, but not only when it changes, but also when we receive it, or when we deliver it.
So really it’s the interfaces that occur between the API and S3 or taking it from S3 and writing it back to S3 or moving that over to Snowflake. Those interfaces is what we need greater observability around. And that’s what we’re going to capture the information with Deequ, and then put it into Databand to give us that information. So we’re going to talk about Deequ a lot here, but what exactly does that mean to use Deequ? What are we actually looking at? The snapshot here on the right shows, how simple it can be to insert Deequ into your system. There’s really four different methods that you call, and this is going to be the different rules that would be applied to your data set. When you go through and you can examine the schemas, and the distributions. Mutual information that occurs between columns. You can even check correlation if that’s something you’re interested in, and there’s things like pattern matching and whatnot, but Deequ runs all of this information, gathers it into these really rich metadata objects like I mentioned.
And then you have these log results that you have to evaluate to then understand what is actionable. And that’s what we’re using Databand for is to take those log results and turn it into something that’s actionable, by giving us a platform that we’re able to monitor that change and really provide that information in an accessible manner, not only to the data engineering team, but any data consumer within our organization. So, with that said, let’s go ahead and take a second here and go and look at how this actually runs and what this would look like in your environment if you were to start today, I’m going to go ahead and share my entire desktop so we can switch between a couple of different screens.
And here’s the sample code that we wrote up for the pipeline today. As you can see, we’re using PI Deequ, which is the Python wrapper around the Deequ implementation and Scala. It’s a really new package I think, opened up in November, but the parody is there. All of those features are available. And even if you come across something that isn’t quite available, you can directly interact with the JVM and the Java objects through the get adders method in Python. So it’s really extensible and it’s really nice to be able to get the customer information you need. With this setup here, all we’re going to need to do is create a Spark session. And then for us, we’re going to be reading from S3. So we’ve already instantiated the Spark session, S3, and we set up the data frame ahead of time to kind of show a little bit of information.
So we can see here that we’ve got a data frame that is pretty wide, 60 columns, but our data set’s not going to be huge for our purposes this time. But if we take a look over at Databand, we can actually take a peek at the information that we want to look at. So if we look at the create data frame tasks, we can take a look at the schema right in data frame or excuse Databand. And here we can see the different columns as well as the data types that are coming through. What’s really interesting as we see everything comes through as a string, when it’s logged here. Now, this is going to be the format for the data frame, that’s returned by, we’re using AWS Wrangler for interface with Parquet, but you’ll notice that a lot of these are actually what appears to be numeric values, right?
52 week high, short ratio, outstanding float. So, [inaudible] make sure we take care of those, whenever we move through the pipeline. We can also take a look over at the data frame itself and get a preview of the data. And some sample values. We see symbol and name. Those are going to be unique to each of the tickers that come through, the companies that we’re looking at. And then we have a description as well as some information about the exchange, there Ron currency. And as we can see, these appear to be numeric properties we would expect. So we’ll have to manage those in the future. At the end of this, we have both latest quarter and fiscal year end. Those are going to be our partitions that we’re going to work with in the future. So, let’s hop back over to the code example and let’s see how we might get started once we have this data frame set up in our environment.
So, the first thing you’re going to want to do is going to work with the profiler that’s available out of the box of the Deequ. The only thing that this requires is to take your data frame, pass it to the column profile or runner and a Spark context. And that’s going to provide you with the metadata object that’s going to have the summary statistics around the data of each column that’s in your dataset. I’ve taken the opportunity to massage this data a little bit in the nicer format and what that’s going to look like is this right here. And as we can see, each of the columns that have been passed through from the dataset gets an evaluation, talks about completeness, how many rows are actually have a value in them, what the data [inaudible] that we see here. Interesting enough, this is a fractional. We saw earlier that there’s a strings, so there’s something odd with the data types that are being passed between Spark and Deequ, and what’s coming in through AWS Wrangler.
So, we need to make sure we flag that for anybody that’s going to be working on this data. Additionally, you’ve got things like the mean, max, minimum. These are going to be good information to understand over time as your data changes, to really give you the ability to get out in front of any potential data issues as soon as they come into your system. So, along with the profiler, Deequ also offers what are known as analyzers for the calculation of metrics. So analyzers are objects that are going to allow you to kind of validate the assumptions or kind of explore the assumptions that you have about this dataset.
I mean, this is really a nice tool to immediately dump in and just do some quick data explorations, quick data analysis. I want to know the size of the dataset. I know this is my primary key, let’s make sure it’s distinct. And do I get a record for every single one that comes through? Other items that are interesting, that we’re going to look at is completeness. Because like I said, these are our partition columns. We need to make sure that every record has one in order to maintain our petitions properly. And the other two that are interesting for us as the compliance rule, let’s set a custom rule that applies to a particular column. So, here, we’re looking to make sure that these values exist or this one value exists. And the unique value ratio, this allows us to look at the number of objects across multiple columns that don’t get repeated. Another way of saying that is if you have a compound primary key, you can check the uniqueness of that compound primary key by using this unique value ratio.
So, once we take our data frame and we pass it into the analysis runner, give it a couple of seconds here to run the Spark stages. We can extract a data frame directly from that analysis runner and as easy as running data frame.show, we now have the ability to look at, oh, we ran the same thing twice didn’t we? Let’s try this one more time. And once we’ve run that data frame, once we’ve pulled that data frame out of the analysis runner, it’s just as easy as running data frame.show for us to really get a good picture of the information coming back from this analysis. So let’s take a quick second to kind of match what we wrote to what it’s showing us from Deequ. We can see here that the size of the data frame gets captured. We’ve got 57 columns. We also see the account of the symbol column, which is our tickers also 57.
That’s a good sign, right? We expect those to match. The other columns we see are returning a value of one. And this is important thing to understand when it comes to the analyzers and the constraints of Deequ, Deequ wants to return you a value that describes the percentage of records that match the requirement that you provided. So, in this example for the common stock, only rule that we set up, which says that the asset type column should always make equal common stock. We see that a hundred percent of the values there actually meet that rule. So, that’s what we want to see. If we have records that were not, that were in violation, I should say, if they were in violation of that rule, we would see a value that would be 98, .98, .86, and this would be the percent of records that are passing, and then the opposite of that would be what’s failing. So, now that we’ve looked at this from a tabular perspective, why don’t we take a peak into database and see what we can see over time when we track these metrics?
So, here in Databand, and we’re looking at the time series analysis of these metrics as they come in for each execution. So each of these bars are going to be correlated to an execution of our pipeline. And for each of those executions, we’re going to the metric for that particular value that.We want to understand as it changes. The two on the right here, correlate directly to the size check that we were doing, as well as the account check on the symbol column that we were looking at. So we can see that the very last column here matches the size that we were looking at before, and then on the positive side, we see that the data set and the symbol count matched perfectly from a graph perspective. So, that’s always showing us that we’re getting the data that we expect in the format that we expect to see it. What this is a little concerning is this empty spaces that we have executions that are reading a partition that exists on S3, but not reporting any data back.
So, we’ll probably want to look into that a little bit later. But let’s take a second here to understand how our metrics that are helping us understand completeness and fidelity actually can tie into the uptime that we were talking about earlier. So if we take a look at the duration, again, this is time series of the executions over time, we see that as we move along, we suddenly have a sharp increase in the amount of execution time that occurred for our pipeline. And if you take a look at the analyzer size and the analyzer column symbol graph, there are two bars that are highlighted, which correlate with this particular execution. Now I would expect there to be a lot of data if we have a large execution time, but we’re seeing the opposite here. So, let’s take a second to look into this task and see what might be going on. At first glance, we don’t have any issues, but it looks like there is an error that occurred.
Click on that task, and if we check the details right, we can actually see the error message coming directly from AWS Wrangler, which is the package we’re using to grab Parkway from S3. It’s saying that I can’t find the file in our S3 bucket. Another nice thing about Databand is that it automatically grabs the user parameters for a particular task. So we can see that it’s looking at the partition for January 31st of this year, and then looking at fiscal year end in July. Okay, so let’s go take a look and S3 and see if we can’t understand why it wasn’t able to identify this particular partition. We can see here in the company overview, we are actually… Do have a quarter for January 31st of this year. And we do have the partition for July, and there’s even a Parquet file as well. That’s odd. I guess we’ll have to dig in this a little bit later, but for now it’s good to know that we can see exactly what may be causing the issue that gives us greater insight through looking at the metrics that are in Databand.
Coming back over, if we can actually set up an alert on any of the items that we’re tracking in Databand, we can set them to be up against a static value, greater than or less than, but one thing that we particularly like is to set up the anomaly alerts. And what this does is it allows us to look at the most recent history of a particular metric. And then let us know if a value that comes is outside of an expected range. So as we see here, we’re looking at duration for the last 10 runs that usually occurring here between about 17 and 21, I guess we’ll call. And the sensitivity slider up here allows us to choose how tight of a window we want to apply to this particular anomaly detection. If we want it to be particularly sensitive, because we need this pipeline to always execute at 10 seconds every single time we can set it super high. If it’s not that big of a deal, we can set it out a little bit lower and let us have a little bit more leeway with the amount of duration time that’ll occur.
This can also be applied to the data set size or the symbol count that we were looking at before. If we take a look at the anomaly, we immediately see that the most recent execution is much more than the ones that we’ve seen over the past 10 runs. And we saw that immediately as well, but if we have this anomaly detection set up, we would have been immediately notified of this as soon as the execution occurred, which would have let us know as soon as the data came in, that we should investigate this and make sure it matches what we expect to see from the dataset.
Josh Benamram: I think something also useful to point out related to our SLAs is having the time series measurement of all these different metrics can actually help us set those SLAs in the first place. So if we see these metrics coming in and we get a sense of what the trends look like and what are outliers for those trends, then we can come up with better expectations that we actually want a guarantee around our data quality. So if we see like our mismatch between the size and the distinct [inaudible] on our main key, if we see that that’s typically within a one-to-one range, or maybe sometimes there’s a slight deviation, we’ll be able to quickly pick that up from these graphs, remove any outliers from that and come up with a fair guarantee that we want to set as a standard for our data deliveries. So helping to measure these over time. Also give us a sense of what kind of expectations and bar we want to set for the standards we put on our process.
Michael Harper: Absolutely. And that’s one of the things that’s difficult when you talk about data quality to begin with, you need to measure metrics at a point in time, but that only really helps us with the completeness and the fidelity like we talked about with Deequ. Tracking this as a time series, that allows us to understand uptime, that allows us to measure how it can affect uptime as well. And ultimately these processes we build, we want to build in fault tolerance and make sure that we have the ability to adapt and react to any bad data that’s coming into our system before the client even needs to ask us for remediation. And while data ban is very much excited to offer observability and help any data engineer come in, and understand the interfaces that are exist in their platform, it’s really our ultimate goal to empower you, to build out an automated process that you don’t have to be in front of a dashboard all the time, because you have a good grasp of the health of your pipelines at all times.
And you can talk to anybody about what may be a potential issue as soon as it comes up through an alert and things of that nature. And the way that we can get there is by collecting the metrics available to us from platforms like Deequ. And the reason that we really fall in the Deequ camp is because of the way that it’s been built and optimized for performance and the way that it has a lot of out of the box features that give us immediate value from the metrics that are being captured. And we want to empower our users to be able to… You capture those metrics and then use a monitoring system that really lets them look into that repository, not just collect that repository. We’ve got a couple of resources here that we want to provide, the various GitHub links, as well as the main to our website and our GitHub repository. That’s our open SDK that you can get there. If you have any questions, feel free to reach out to us, super excited to talk data all the time. But with that, I’ll pass it on over to Josh.
Josh Benamram: Thanks Harper. Yeah. So if anyone has any questions about our use of Databand, our use of Deequ or data, as always, in general, we’d love to hear from you and really appreciate everyone joining the call today. And thanks very much for the time.
Josh comes from a varied background with a common thread of data obsession. He started in the finance world, working first as an analyst at a quant investment firm, then at Bessemer Venture Partners w...
After years of studying Accounting, Mathematics, and Economics, Harper stumbled into the world of Big Data and has never looked back. Most recently, Harper has led Data Engineering teams in the NLP an...