Modern data lakehouses have enabled scalable data engineering that brings together more data than ever. But many organizations are discovering that more data doesn’t mean better data. In fact, data quality and trust issues become more prevalent and harder to solve as the volume of data increases.
Enter continuous, self-service data quality, powered by Collibra Data Quality (formerly OwlDQ), which leverages Spark parallel processing across large and diverse data sources. By combining this solution with Databricks, organizations can create end-to-end high-quality data pipelines for scalable and trusted analytics and AI.
In this deep dive, you’ll learn:
Speaker 2: Today, I’d like to talk to you about the importance of a scalable, accessible and self-driving data quality program in order to bring trust to the AI and business intelligence and analytics programs that organizations often put in place to monetize their data.
We’ll start off with a quick introduction to Collibra DQ and the bigger picture of data governance that Collibra offers. But we’re going to spend most of our time talking about what it’s like to actually implement some of these concepts in order to bring value to the data that organizations like to monetize. We’ll also talk specifically about the incredible synergy between Collibra DQ and Databricks and specifically, Delta Lake.
Collibra DQ is part of a larger ecosystem that starts with an end to end data governance platform and brings data privacy and ultimately leverages catalog lineage and data quality in order to provide additional insights and improve time to trusted insights on data, that ultimately drives all of those business benefits like revenue, operational efficiency, risk mitigation.
And so, one of the major differentiators of Collibra DQ is that it addresses some of the key pain points that organizations grapple with today. The three main ones seem to be, one, there are simply too many rules to write across an enterprise collection of data. There could be tens of thousands of rules or could be more. And once you write the rules, you then have to maintain them. This often requires a large team of people to do nothing but chase the movement in the data to make sure that the rules reflect that movement and are still able to capture the anomalies that those rules are meant to detect. The second problem is that there is simply too much data. Lots of tool sets are simply not able to scale up and to address the volume of data that an organization needs to be able to cover, to have a robust data quality program.
And finally, data quality is just not accessible to the people that really understand the data and how to monetize it. And what inconsistencies in the quality of the data actually mean to the bottom line of an organization. This is often because it is actually IT that’s responsible for implementing and maintaining the program while the business analysts and decision makers that use the data often end up simply receiving the ancillary benefits, right? Hoping that the data ends up being trusted if the data quality programs put into place. It’s hard for them to play an active role, typically because of the way the tool sets are structured.
Collibra Data Quality addresses these concerns in several ways. First, the volume of the rules that need to be written. There are always going to be some rules that require human input. There are some eccentricities in the data that simply requires the internal knowledge from people within the organization. So, to apply in order to make sure that certain aspects of the data meet certain quality standards, however, there are vast amounts of rules that can simply be discovered from within the data itself, which is, is your data normal based on what it has been in the past? And the Collibra DQ is able to detect those basic baselines within the data, as it streams in and as it is brought in to the data set over time and then automatically, not only apply those rules, but also adjust the ranges that those rules are checking for as the data changes and drifts over time.
The second major differentiator for Collibra DQ and the way that it addresses the second pain point is it will simply scale up to the size of any data set as long as there is compute available. This, by the way, is one of the ways in which there’s great synergy with Databricks and Collibra DQ in that Databricks can scale up to just about anything. It’s just a matter of bringing in more compute and Collibra DQ is uniquely positioned to simply write on top of the compute power that data brings, can bring to bear and process any amount of data concurrently to be able to scale up to data quality program of any size.
And then finally, because of the user-friendly nature of DQ and the fact that it hides some of this complexity, it allows business analysts to come in and directly be involved in the data quality program itself. Setting up their quality checks or simply reviewing checks of data that went through a pipeline to be able to provide feedback to engineers or set up their own checks and then help them make the data quality program more robust, provide faster time to feedback, faster time to market and putting in place the robust data quality program that organizations need in order to be able to trust their data in important decision-making and ultimately drive those business benefits we talked about earlier. And then as you go beyond simply data quality, Collibra is able to bring to bear the other parts of data governance like metadata management with the catalog, lineage to understand where your data came from and address requirements of regulations like GDPR and so forth and so on.
So, with that out of the way, let’s actually talk about what it means to create a data quality program on top of a modern data lake house and the synergies and benefits that that brings and see how we can actually take the really technical aspects of that like the ETL and perhaps actual programming of ETL processes, but then actually leverage the power of DQ to help drive improved efficiency and quality within the data as a moves from bronze to silver to gold and ultimately drives decision making in the organization.
So, as we talked about the pain points, one of the big ones, as we imagined during our mergers and acquisitions process, a data engineer spent a whole lot of time trying to manage the quality of the data as it’s being ingested or merged between organizations. You can imagine a scenario where an analyst looks at their dashboard and says, “We know we have this machine learning program in place. And there are models that are telling me certain decisions that I should be making in certain scenarios. But I know I happen to know that this suggestion is just not right and I’ve seen too many of these today.”
And then the data scientists might say, “We haven’t deployed a new model. Nothing’s changed. I can take another look but I don’t think anything changed.” So, then we say, “Well, what happened? Is there something wrong with the data?” And the data engineer might say, “All right. Let me go back and check 300 rules to see which one broke or why,” or, “This data set has a thousand columns. Let me go figure out where the problem is actually coming from.” So, if you combine the power of Databricks and Delta Lake to ingest the data and move it through this process, of course, that’s what will to feed the machine learning program, right? The data science team is going to use that data to create models. We need to make sure that data is quality.
And so, the ideal approach is to actually apply quality checks to the data as it’s moving through the ETL pipeline. So, right as it’s being ingested and moved to bronze, silver, gold, we should be checking it in between those stages. Now, Delta Lake does a great job of doing things like enforcing schema. There’s a whole lot of value there but then beyond that, what happens when an ETL load is rejected for unacceptable schema evolution, for example. How do we actually go back and remediate that? How do we figure out, what are the things that actually went wrong in order to figure out what we need to do upstream to fix that problem, as an example, and this is what we’ll show in a demo.
So, this is the architecture of the demo. We’ll be referencing Collibra Data Quality and there will be a web application user interface that we’ll be looking at. There is any number of ways to set up that process but you can have engineers do the ETL pipeline and actually see what rules or discovered rules or anomalies DQ detected as the data was moving and then go fix it themselves. Or they can be notified even by business analysts who comes in, goes to their BI tool, or before looking in their BI tool, checks there to see if the data load was successful. Not just from a functional, right, but from an actual quality perspective. Is the data normal based on what we saw in the past? And then might say, “Hey, something broke today. I don’t like the quality score. Could you take a look at the ETL process? And here’s what I saw.”
It could be set up in any number of ways. But basically, that’s where the dashboard will be. And we will actually be looking where, in this case, the data engineer will set up a pipeline in the Databricks notebook that will kick off the ETL process. It will pull some data from an S3 bucket and it will begin to write it into the Delta Lake and then we will have various scenarios that will come up and we will use the combination of Databricks and Collibra DQ to address and ultimately, help bring the value to the business that they need from the data. So, let’s take a look at how this actually works.
So, here we have… It’s a little catalog of some existing data sets. What we’re going to do now is move directly to the Databricks notebook and what we have set up here is a simulation of data being loaded day by day. So, we’ll load New York stock exchange data and end of day positions for January 1st, 2018, second, third and so forth. And let’s see how this load goes and if we run into any issues along the way. So, using DQ within a day within an ETL pipeline, specifically with Spark, is as simple as simply taking the DQ core bits and putting them on the Databricks cluster. Just installing them as a jar. And now you have access to all of the DQ configuration parameters and then actually kicking off, you’ll see a little bit further down the DQ context that allows the logic for data quality to be applied in the same memory space and processing space that the ETL transformations are running. And so, this is maximum efficiency. You don’t have to first land the data and then quality check it. You can do it right here in the pipeline.
So, here we have some configuration set up. We’ll talk about what some of these are a little bit later, but basically, this is all it takes to set up DQ to quality check within an ETL pipeline. And then here we’re actually simulating the load. We’ll just be pulling right from this S3 bucket for the date that we’re loading. And then these are just outputs and you could see down here, we’re actually writing this out into a Delta Lake staging area. So, let’s see how this goes. We’ll start loading on January 1st, 2018.
So, now the ETL process is running and we’re going to jump over to our dashboard. See if we could spot any consistencies in the data as it picks up. And we’ll just wait for this to pop up in a second as the quality check actually ramps up and the ETL process gets going. We could see the process has now begun. Quality check has started. So, we should be seeing our data set pop up and there it is. Let’s drill into it.
So, there we go. The first day has loaded. January 1st and we could see we’ve already picked up a shape issue here. Shape issue in DQ just refers to format. So, for example, here, we’ve detected that there is a field called symbol, as in the ticker symbol. And typically, you would expect this to just be a set of characters, just alphas. But in this case, we can actually see that there is a dot in the ticker name and that is not typical. So, DQ automatically picks that up but it’s not affecting the quality score too badly just yet. Let’s let this go for now and we’ll see if we have to come back to it. Let’s just keep an eye on that for now.
Now we’ve loaded day two and day three. And it’s important to note that as it’s loading here, DQ was actually learning what is normal for the data as this is going along. So, after we move past the learning window, the model will actually be ready to start making some decisions about whether the data is normal today. And here we can see we’ve actually hit a problem here. It looks like there’s been an issue that’s big enough to fail the quality. So, before we were seeing maybe a point deducted or so, but this time there was a real problem. So, let’s pop back over to our ETL process. But before we do that, let’s just take a quick look. It looks like there’s been a major data type shift in this when we were loading this day and we can actually open one of these up and see…
Yeah. There’s typically not any movement in this data in terms of data types. And typically we see things like doubles in this field but really, we’re actually seeing some integers along with some doubles. And so, something has happened to the schema in this field, it looks like. In this load. And then if we come back over to our ETL process, sure enough, we got this analysis exception and that’s because Delta Lake actually kicked us out. It said, “I’m not going to allow this process to go through because I’ve learned that this trade date field is a timestamp.” That’s the schema that’s been applied to it. “But today it’s a string. I’m not going to allow this to go through.” So, this is extremely useful and really highlights the synergy between Collibra DQ and Delta Lake. The only thing is now we have to go remediate this data and this is where… There’s nine columns here but there could be hundreds or even thousands of columns in a data set, some really big ones, some really wide ones.
We’d have to know what to go fix. It’s unlikely that it’s just that column that broke. Right? So, now if we could actually come down and look at the preview, we could see that actually, everything is a string now, which doesn’t sound right based on what Delta Lake is telling us. So, something clearly went wrong with this ETL process and you can actually see DQ pick that up just by learning. I haven’t added any rules. You could see. I’ll show the rule engine in a bit but there are no rules added on this dataset, no manual rules. This is all just being driven by what DQ has learned is normal for this data set. So, let’s go back to our ETL process to see if we could fix that problem.
And actually, if we take a closer look here, we could see that somebody has turned off the first schema because we’re pulling this out of a raw bucket. It’s just CSV files. And if we turn off schema and firm it, then we’re just going to get strings. That seems to be our whole no problem here. So, let’s just get rid of that. And you know what? While we’re at it, remember that shape issue that we saw? Let’s actually fix that too. So, here, we’re actually going to address that shape issue and fill it in with characters because there shouldn’t be a dot. So, let’s fix that one as well. And then I don’t need to rerun the first three days. Those went pretty well. So, let’s just rerun as of the day that went badly. So, we’ve started back up. Let’s pop back over to our dashboard.
You could see January 4th has disappeared and it’s being reloaded now. There it is popping back up. If I just double click to refresh here, you could see the problem is gone. So, we’ve identified it. Delta Lake saved us from ourselves, in that case. And we were able to identify the problem using DQ you and fix it.
So, now we’ve loaded day five and now we’ve hit another problem. And let’s see what this is all about. So, we’ll just come in and let’s stop the process. Let’s see what happened here. So, we drew in on January 5th. DQ is identifying… Though this time, Delta Lake let us go through. The load went. We wrote into the silver staging area. But if we look, it looks like there’s a column that’s entirely null. DQ is detecting that this column typically has zero nulls in it at all.
Today, it’s a hundred percent null. Now, it’s not automatically bad to have nulls. Some columns have a normal amount of, or even a range of null values in terms of percentage of the whole that exist. And that might be okay. But this particular column, DQ actually has learned that it should have no nulls in it at all. Today, it’s a hundred percent. We can see it here in the preview as well. Something clearly went wrong. Let’s go see if we can identify the issue this time. If we scroll down and look through our ETL process, we can actually see that there’s a box transformation here.
The whole column is being sentinel. So, let’s fix that ETL code. And the fourth was loaded okay this time. So, let’s just only load the fifth this time. And let’s see if that fixes the issue for us. So, we kicked the ETL process off again. So, this time it looks like everything’s moving again. And there’s the fifth and this time it looks like it’s loading okay. So, this is going to run through and we’re skipping over a weekend here. Now we’ve loaded the ninth and it looks like there’s a major issue on the ninth. Something really went wrong here because you could see it’s zero. Something is really bad. So, let’s stop this process again and see if we can fix the problem.
So, if we drill in on the ninth, we could see everything’s wrong. Pretty much just about every column has an issue. But if we take a look, there’s no nulls or anything but this time, we’re getting cardinality rules being broken. Again, rules that were not written by anyone. DQ just learned that typically this column should have a cardinality of 2,417, 2,404 and so forth. But today, for high, for example, the cardinality is 326. This is clearly not right. It’s way outside what you would expect. And so, we have major score deductions for that. So, if we look closer though, we see that we’re also seeing a major drop in row count.
And DQ is identifying that as the root cause of the rest of these. It’s actually pointing it out to us. So, if we go back to our ETL pipeline, it looks like… Now I’m representing this as a sample but you can imagine how there could be partial data sent on that day. The complete set of data that normally comes in was not sent. There’s a major drop in the number of rows that should have come in and that does trigger an alert. So, let’s just imagine we you spoke to our team that sends us the data from our stream and we got it fixed. So, now we have the complete data set. So, we’ll just take that out. We would expect the load to go. Now, as you could probably imagine, when I hit go here, this will clear. You could see the 10th started to load and it’s clean right now and no rules fired. There wasn’t anything that we saw in the data that are learned rules.
But what if there are other types of anomalies that are more esoteric than what you could typically write a rule for? Like categorical outliers or dupes or something along those lines. So, DQ has additional capability to use machine learning, statistical techniques to identify anomalies like that as well. So, let’s turn that on as well while we fix, what we think we fixed, the major issue that we had. But now let’s see if we can detect some of the smaller anomalies in the data. So, we will actually just come here and you can see some of these more advanced features. We will turn on outlier detection and we’ll segment by symbol and make sure that it came in and it’s being listed on the right exchange in our data. And then we’ll also turn on duplicate detection. Since this is end of day data, we should really only have a single entry per symbol.
All right. So, now that we’ve set that and let’s kick this off one more time. And actually before we do that, we don’t need to rerun the days that went okay. The eighth went okay. The fifth okay. So, now we’ll just reload the ninth and load the 10th. So, the ETL process has been kicked off again. Now that we’ve turned on some of the more advanced features and fixed the major anomaly problem with the data. Jump back in. We could see that the ninth has now loaded and cleared. And I would see if there’s anything new that we can pick up on the 10th. Just take a second to run. Just check back into our process here. We can see it’s still running. We’re doing some slightly more advanced processing now and using some pretty advanced statistical techniques. And there we go.
So, the process is finished at this point. The 10th, let’s drill into that, has loaded successfully. We don’t have any major issues or rules firing here. We don’t see anything big breaking or anything like that. Everything seems to be passing for the rules that we’ve learned. There’s no new rules that have been added. So, everything looks okay but it looks like we found some esoteric issues. So, first we found a categorical outlier. If we drill into it, you can actually see that the symbol AAN, typically shows up the NYC exchange. But today, you see we’re detecting as an outlier, it’s showing up on the NASDAQ exchange. And based on history, DQ was actually looking back. So, we’re loading the current day and also looking back five days and identifying that based on that history, this symbol should not be listed on the NASDAQ exchange.
And we’re raising that as an outlier. Now, our scoring, we can actually change the value of the bounce score that we assign to this type of issue. It’s not really killing our data set here but this is something that a business analyst might come in and say, “Maybe it’s not worth reloading the whole thing but this is going to do me no good if I put this in a report. If I start trying to make decisions but indices or something like that… This is no good. And also, you found some dupes.” Now, this is position data. So, in a given day, there should only be one entry per symbol. So, high, low, open, close and so forth. But here, they’re saying we found three instances of the symbol in this day.
And again, we’re not assigning a huge score to it. So, we didn’t fail the run but this is something that could be an issue as it could skew aggregates downstream. Particularly, on aggregate reports. So, maybe this is something that we need to address. And this is where Delta can help us, once again. Delta Lake actually has the ability to do updates and deletes directly on data that’s already been written. And so, that would be a huge help to maybe avoid having to reload the whole thing and maybe just change the data that’s already been loaded and then just fix it upstream for the next day of loading. Yet, another place where there’s great synergy between the Delta Lake, Databricks and Collibra DQ.
So, that’s just an overview of how an organization can implement a scalable, self-driving and accessible data quality program that can greatly increase the effectiveness and value by adding additional trust and peace of mind to decision makers for the data that they use to make major decisions for the business. This is the key differentiator that Collibra brings to the table along with an entire suite of products that improves time to insight and complete peace of mind that you can trust the insight that you’re seeing.
If you like what you saw today, contact us. We would love to do a data assessment. We could schedule that. You can also request a live demo and talk about the specific use case that you’re interested in. We’re really excited to hear from you and thank you very much.
Vadim is a Principal Architect at Collibra focusing on autonomous data quality and cloud native architecture. He previously led Research and Development at OwlDQ, responsible for enabling data quality...