Building Data Quality Audit Framework using Delta Lake at Cerner

Download Slides

Cerner needs to know what assets it owns, where they are located, and the status of those assets. A configuration management system is an inventory of IT assets and IT things like servers, network devices, storage arrays, and software licenses. There was a need to bring all the data sources into one place so that Cerner has a single source of truth for configuration. This gave birth to a data platform called Beacon. Bad data quality has a significant business costs in time, effort and accuracy. Poor-quality data is often pegged as the source of operational snafus, inaccurate analytics and ill-conceived business strategies. In our case since configuration data is largely used in making decisions about security, incident management, cost analysis etc it caused downstream impact due to gaps in data. To handle data quality issues, Databricks and Delta Lake was introduced at the helm of the data pipeline architecture. In this talk we’ll describe the journey behind building an end to end pipeline conformed to CI/CD standards of the industry from data ingestion, processing, reporting to machine learning and how Delta Lake plays a vital role in not only catching data issues but make it scalable and re-usable for other teams. We’ll talk about the challenges faced in between and lessons learned from it.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hello everyone. My name is Madhav Agni and welcome to our Spark Summit 2020. I work for Cerner company and today I’m gonna present on building data quality audit framework using Delta at Cerner. So a little background about me. So I started my career in traditional data warehousing, and data mining and reporting, and then moved on to Big Data, Hadoop and MapReduce. And then, currently, I’m working in the cloud, building data pipelines end to end. So I’m gonna start with introducing what Beacon is. Beacon is my team and also the data platform. Then I’m gonna talk about Beacon’s architecture, and then I’m gonna switch gears and talk about data quality, need for data quality audit. And then, talk about Delta Lake, some of its features and how that fits into everything. And lastly, I’m gonna present a demo with showcasing the framework from end to end.

Cerner today

So Cerner is a healthcare IT company. So we sit at the intersection of health and IT, and it’s in that place that we use emerging technology to engage individuals in their own health. Well, I mean, clinicians with data lake empowers them to deliver smarter care instantly. So we support our clients by providing data-driven insights that enable them to make better decisions for better healthcare management. So Cerner is spread across the entire globe in more than 35 countries. And then we have 6,000 and more hospitals that use Cerner’s systems. We have traditional EHR systems, cloud-based EHR systems, Radiology Information Systems, et cetera.

So what is Beacon? Beacon is an inventory of configuration items and assets. So when I say configuration item, I’m talking about a physical server or virtual machine, network device, a storage array, a laptop, as well as the power resources. So it doesn’t matter if it’s AWS, Azure, our own private cloud, our resources, we are responsible for maintaining the data for all of these, so we capture the configuration item itself as well as its relationships and attributes. So this configuration management system is what is called Beacon.

So why do we need Beacon? So I don’t think we have multiple tools at Cerner with asset and configuration management. So those are all disparate pools managed by different teams. So it makes it very hard to leverage the information and Beacon brings it all together under management of one team, are responsible, it mainly lies in data and it’s data quality. So we have different use cases that are needed in order to function Beacon. One of the main thing is security. So to protect Cerner, we need to protect the perimeter. So to do that, we need to know where the perimeter is. So for example, next, any day, if there’s a virus that affects Windows Server, the first thing that you want to know is how many Windows Servers there are. And then the second thing would be kind of attributes and data related to those servers. So to do that, we need to have all that information in one place. The next thing is costing and licensing. So the doubt, basically, doing those estimates for licensing, and we need to have that data in one place so that people can make decisions on top of that. The most important part is operations. So now they deliver the operations in configuration management, incident management, service management, this data is used.

So this is Beacon’s technical architecture. So Beacon has about 25 different data sources, and they are presented at the top. So we use a push and a pull model. So in push, we push the data from on-prem into the cloud using open-source tool called as Rundeck. And in the cloud, we use orchestration service called as Data Factory. So Data Factory has multiple connectors. For example, if you wanna connect to like an Oracle DB, SQL Server, FTP Server, we mainly need to set up that connection in Data Factory, and it will start pulling that data. So once we gather all the data, we store in Data Lake Store, which is in Microsoft Azure. Or when we want to process the data, we use Databricks as well as MLflow for our machine learning use cases. And along with that, we transfer some of that data into a SQL DB as well for some traditional reporting tools. And then we have Power BI which sits on top of Data Lake and Databricks that we can directly use to report data out of it. And all of this is controlled by a central CI/CD pipeline, where we, at the commit of a button, we push our code into cloud directly.

What is Data Quality

So what is data quality? And where does data stand among all of these things? So you can regard data as the foundation for a hierarchy where data is at the bottom level. On top of data, you have information being data in context. Further up, we have knowledge that is seen as actionable information. And on top level, there is wisdom. So as you can see, if you have bad data quality, you will not have good information quality. And with bad information quality, you will lack the actionable knowledge that you require in business operations, and so on. Usually, it’s not hard to get everyone in accordance and agree that we need some data quality framework to solve all these problems. And, really, that’s where you need to showcase the problem that lies in data quality capturing.

Why Data Quality is Important?

So some of the reasons why data quality is important is because, without that employees are not gonna work with their BI applications because they will not trust the data. And also, internet data leads to false facts and bad decisions. It is very important that we reduce the data silos that are present in companies because of multiple tools having data in different places. That was one of the most important reason why Beacon was born. Also, for stricter compliance requirements like GDPR, you need data quality. Otherwise, there will be multiple issues.

So this represents a perfect illustration of the problems in data. And many companies are still afraid of data optimization projects, such as data quality and master data management initiatives, the organizational procedure and technological adjustment that have to be made, and seem too complex and incalculable. On other hands, companies that successfully render data quality and master data management initiatives achieve their success because they simply took the decision to launch their initiative and then evolve it step by step. Don’t forget data is now the most valuable resource we have in the world.

So let’s switch gears and talk about Delta Lake. So if you don’t know already, Delta Lake is an open-source storage layer that brings ACID transactions in Apache Spark and big data workloads. So it works seamlessly with your streaming as well as your best use cases, because you don’t need that Lambda framework in order to split your data workloads. It all works together under one platform. And then you can also bring your own data lake to it. So for example, if you have it in Azure, or AWS, or traditional Hadoop, it works everything in accordance.

So some of the features of Delta Lake, the main one being ACID transactions, but I’m gonna focus on some other features like time travel. So it’s basically a data versioning system where you can go back in time and figure out what the data looked like, basically a snapshot of data. So I’m gonna talk a bit about it in my demo later. The other one is audit history. So similar to time travel, you can pretty much see the full trail of audit on all of your data sets that you have enabled Delta for. And the main thing I’m gonna talk about are your updates and deletes. So, you know, when you do traditional warehousing, you can use the mode-statements to do those upsurges, and Delta Lake makes it easy to make those mode-statements happen.

So this is the overall audit framework that I’m gonna talk about. So first thing we have is the Azure Data Factory. So that is the orchestration service that we use to do all the data flows from end to end. So we use Data Factory to pick off the Databricks notebook, which is in this case, will be a Delta Lake notebook. It will do the audit that we set up in the notebook, and then all of the cables that will get refreshed will be used in Power BI for some after-the-fact reporting. And based on that, we have set up alerts on some of the audits so that we don’t actually have to go into reporting to look at the data issues, but the alerting system would alert us on when an audit would fail based on the parameters that you set up. So, one thing that is important is that you can switch other tools that you use along with Delta Lake. Making sure that the integration works fine among those, it’s not like you have to use this exact same workflow in your company.

Impact of the framework

So what was the impact of this data quality audit framework? We saw that we, there was an increase in pre-prod defects, when I’m saying pre-prod that’s, those are the defects that your dev teams log before you go into production. So you’ll want to make sure that that ratio of pre-prod defect is much higher. Also, SIE test executions of developers right at a site as plants produce some data validations. So instead of writing those individual test cases, they were able to leverage this framework, so that made them spend less time in writing new test cases every time, but just focus on enhancing this audit framework. One more advantage was that every time we used to do a new data audit before, that involved a code change. So a code change, meaning we had to submit a change request every time we do a release. But after moving to this updated framework, it was just a configuration change for us to add a new audit every time. So that caused a reduction in the change request as well. The main thing was we did not have to depend upon, you know, my stakeholders or data analyst to find defects in data. We were able to find them ourselves before it reaches to the stakeholders. So that made that our overall Prod defects had reduced by almost 25%. So data rollbacks is something that was very new to us. A lot of times when we were sending data to external system, there used to be data issues. So we used to do code rollbacks, but that was not a good way of doing data rollbacks. So I’m gonna talk about it in my demo as well, but it was the ability of just going back in time and sending the good data back as a rollback. So with that, I’m gonna jump into my demo. I’ll talk about this entire framework from end to end.

Let’s start with Azure Data Factory. So we use Data Factory as our orchestration service to organize all our data flows and its dependencies. So that is from data ingestion, data processing, running ML way of proofs. We use Data Factory to orchestrate everything. And for this demo, we’ll use an example of a computer system dataset. So the use case over here is that, internally, after our processing is done, we generate a computer system data set and create a Delta way, comparing it to an external system. And then, we use audit to capture any data issues related to that Delta, and if the data Delta looks good, we send that Delta to the external system using an API. So we’re here, I’m gonna talk about one of these pipelines over here. So if I double click, I see that I have three notebooks over here. And then the audit notebook is the one that gets kicked off after my staging table sketch generated. So I’m gonna talk about the audit data book in this tab over here. So this is Azure Databricks, or internal workmate. So we’re using Delta, I don’t need to do anything specific, it is included in the run task itself. So we’re here, I’m just trying to import all the libraries that I need for downstream processing. Creating the data in partitions. And this method is to connect to Data Lake Store. So we use Data Lake Store to store all our data in terms of files of different types.

And then for this audit demo, I have audit query file data storing Data Lake Store. So for this, I have created a method to connect to Delta Lake store, so it will use my credentials to connect to Delta Lake store, generate a token, and then connect to it. The other method is to collect to our SQL DB. So, as I mentioned, I need my staging tables. That will generate credit from the pipelines, and try to find data issues in those. So I’m connecting to that SQL DB using this method, and then I’m defining a couple of more methods. I want to read the table and then create a temporary view out of that table. This is my main function. I’m passing all my credentials and my table names that I need, and then it will try to connect to DB and read all the tables, cache those, and then create data views on top of that. So this is the main part. So the first step I’m doing is, I need to create a dimension table to capture my audit and that dimension table, I am creating using Delta. So this is create table if not exists. I’m specifying all the attributes that I need. And then, for using Delta, I’m just specifying the Delta keyword over here, and then I’m partitioning my table by year, month, and day.

Similar to this, I’m also creating a fact table to capture some of the counts. So after all of this is done, I’ll also try to do some trend analysis in Power BI. So for that purpose, I’m creating a fact table using a count, and then this is also a Delta table that is partitioned by year, month, and day. And then, for all the issues that we’re trying to capture, I’m creating a specific table for that. And that, what that will host is just the main attributes that I need, and what is the data issue that it belongs to. That is also a Delta table over here, and then I generate this table every day.

So the main part of, main advantage of having Delta table is you can run these merge commands. So for the people who have used SQL store procedures before, these commands would seem similar and you can, with the same syntax that we already know of. So it’s merged into your target table, which is in this case, my dimension table using my source table, which is the sys table, and then I’m defining my key for my upsurge. So in my case, I’m using, two ID columns that form the key. And then I’m also using, since everything is date-partitioned, I’m using that date partition as part of the key. And what I’m doing is I’m saying if it matches my target table, then just update my timestamp. Otherwise, insert those rows into my target tables. I’m inserting all my attributes over here and specifying the values. So this is a standard syntax with Delta. You can specify multiple conditions over here. I’m just using a standard of set condition. You can also do deletes, if you need to, in my case, I don’t have a use case for it, so I’m just using the absurd method.

So once I do my absurd, I’m gonna come back to this command a bit later. So, what, the main thing I need is once I have all the data in my dimension tables, I need to generate audits. So when I say audit, it can be a data analysis audit. So when you know something is wrong, you try to create a query for that and then insert that query into my audit table. I’m going to also show an example where I can run an ML model on top of my Delta notebook and the audit that it captures, I’m gonna show is part of my audit query as well. So to make this reusable and having not to do any code change in the future, I’m storing all my queries in Data Lake Store as a CSV file. So this way, whenever I want to add a new audit, I don’t have to modify this code to generate that audit. I simply have to do a config change, edit that, add a new query and save that file. So what this does is it will point to this part and then it will create a Pandas dataframe, convert the query file into a list. And then it will, in the for loop, it will run those queries one by one. So to know what it looks like, I’m gonna jump into Data Lake Store. So this is the path it was referring to. So this is just nothing but a CSV file with all the queries that I need in my audit framework. So I’ve downloaded that file. So it kinda looks like this, so instead of running insert queries into the notebook, I’m just creating a text file. And then this will capture anything new that you add, and that will automatically be added as part of my audit framework.

You can define the type of audit based on your team’s names. So in this example, I’m trying to capture some missing information in my data. So for my use case, if my data has any of these three attributes missing, this will fill my API call. So what I’m saying is I’m selecting those rows where either of three, these three columns are null, and I’m inserting that into my audit table. Similarly, another common type of audit is duplicate. So in this case, the Delta that is generated, I’m trying to find if any of those rows have duplicates in those, so I’m trying to run this query based on the ID, where on the naturally key, it does not match. I’m going to flag it as a duplicate entry. So once I have those duplicate entries, I’m inserting those into my audit table. So these are good examples when you know that something might be wrong with the data. There are other examples you might not know if something is wrong with your data. So a classic case is trying to find anomalies, different types of anomalies in your data. So in my case, I’m trying to find out if all the serial numbers that I have in my data sets are correct. So for this, I’m gonna use ML processing to calculate z-score for my serial number attribute. So this is part of MLflow framework instead of Databricks. So for this, I don’t have to do anything separate, but MLflow is included in Databricks. So I will generate those, generate that ML analysis and then create an audit on top of that to find out what are the anomalous serial numbers in my dataset. So for this, I’ll jump back for Databricks. And then, so this is the command that is doing some pre-processing and then in this scribe, calculate z-score for one of the attributes. So once that z-score is calculated, I’m essentially saying that, for all the records that have a z-score of greater than, it need to tell us something that we have defined. I’ll capture those and then create a temporary view. And then I’m gonna use that temporary view in my audit table over here saying, for all those anomalous records that you find insert that into my audit table.

One other important thing that I talked about during my presentation was that the ability to sort of go back in time using the time travel option in new rubrics Delta, and then doing a data rollback. So what I mean by that is that using the history command for a Delta table, I can figure out the entire history of that table since inception to the current state. And it will tell me the version number, the timestamp when it was created and what was the operation. So what I can do is if I need, I can go back in time and then reload the data as off that particular version. So for example, one time we encountered data issues

when we sent our Delta into the external target system. So we wanted to essentially do a rollback. Now, you can do a rollback easily for code but not data. If you’re trying to do a rollback for data, it’s mostly a manual operation where you’re trying to figure out what all things you inserted and updated, and either do a delete of all those operations or manually create queries to sort of undo all those operations. Now, the great thing about Delta is that I can just select all the records that I need and specify a previous version. So in this case, I’m saying pull all the records as of version 140. Now, what this will do is, it will look at your version

and then basically store all that data into this query,

which then you can use to load into your external system. Once all my data audits have been populated, I’m going to use Power BI to see and report on that data. So I’m using Power BI Spark JDBC connector to connect to my caster, and then specified the then table that I created just now, and then load all that data and refresh it into a Power BI dashboard. So this is a good way for our data analyst or even your team to look at all the audits that you created in the previous step and do some reporting on top of this. And also, you can send alerts based on this. So if I want to alert my team in Microsoft Teams or any other platform, you can do that using a Power BI refresh. And then this is kind of what it looks like, in that it will send an alert whenever there is audit running and it will capture counts, or if you wanna do a separate type of audit alert, you can do that as well. My main thoughts are, so you don’t have to use the same services that I used in this demo. If Databricks is your central processing framework, you can use the frameworks that are part of your X stack. If you are working out of AWS, you can use S3 in place of Data Lake Store and use their orchestration service, and it will work fine with all those services as well. That concludes my demo. Thank you for your time. Let me know if there’re any questions.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Madhav Agni


Madhav Agni is the Lead Software Engineer on Cerner's configuration management team working with technologies such as Data Lake Store, Databricks,Kafka, Power BI. Madhav's primary area of focus is technical and data architecture and leading teams in US and India in building the data platform. He also serves as a key member on Architecture governance board inside the organization. Madhav previously worked as an ETL lead at USAA specializing in data warehousing using technologies such as Datastage and Peoplesoft.