Automated Metadata Management in Data Lake – A CI/CD Driven Approach

May 26, 2021 05:00 PM (PT)

Download Slides

We as data engineers are aware of trade off’s between development speed, metadata governance and schema evolution (or restriction) in rapidly evolving organization. Our day to day activities involve adding/removing/updating tables, protecting PII Information, curating and exposing data to our consumers. While our data lake keeps growing exponentially, there is equal increase in our downstream consumers. Struggle is to maintain balance between quickly promoting metadata changes with robust validation for downstream systems stability. In relational world DDL, DML changes can be managed through numerous options available for every kind of database from the vendor or 3rd party. As engineers we developed a tool which uses centralized git managed repository of data schemas in yml structure with ci/cd capabilities which maintains stability of our data lake and downstream systems.

In this presentation Northwestern Mutual Engineers, will discuss how they designed and developed new end-to-end ci/cd driven metadata management tool to make introduction of new tables/views, managing access requests etc in a more robust, maintainable and scalable way, all with only checking in yml files. This tool can be used by people who have no or minimal knowledge of spark.   

Key focus will be: 

  • Need for metadata management tool in a data lake
  • Architecture and Design of the tool 
  • Maintaining information on databases/tables/views like schema, owner, PII, description etc in simple to understand yml structure 
  • Live demo of creating a new table with CI/CD promotion to production
In this session watch:
Josh Reilly, Developer, Northwestern Mutual
Keyuri Shah, Lead Software Engineer, Northwestern Mutual



Keyuri Shah: Hello, everyone. I hope you all are having a good time at the summit. Welcome to our talk: Automated Metadata Management in Data Lake, a CI/CD Driven Approach. In this talk will show you how easily and effectively you can manage and maintain schema in your own data lake environment. I am Keyuri Shah, lead engineer at Northwestern Mutual. I have been with NM for the past 10 years, and part of me is innovation projects. My most recent one, was to create a big data platform for our field reporting space. I am accompanied with my colleague, Josh here.

Josh Reilly: Hi, I’m Josh Riley. I am also a lead engineer at Northwestern Mutual. I’ve worked for the company for 9 years. My background is mostly in web development. However, I recently had the opportunity to work on an exciting new project using Spark.

Keyuri Shah: Thank you Josh. Before we get through the agenda here, let’s do a quick introduction on who we are and our legacy. Northwestern Mutual is A 160 + year old fortune 102 company that provides financial security to more than 4.6 million people. We deliver this through a holistic approach that brings together a trusted advisor and a personalized financial plan that includes insurance and investments powered by technology, to drive a seamless digital experience for our clients. Technology accelerates our business and company strategy to deliver this digital experience to our clients. It also helps our 10,000 + financial representatives throughout the country to better serve our clients. We have been on a digital transformation journey for several years now, and the pace is accelerating more rapidly than ever before to meet the needs of our consumers today. We have recently gone through a migration effort to transform our BI space to a big data platform, which uses Databricks as ELT tool and data tables as our storage for the data lake.
There is a talk during the summit, titled: Northwestern Mutual’s Journey-Transform BI Space to Cloud, by myself and Madhu Kotian. If interested, I would recommend attending that. Well, coming back to our agenda here, we did go through a quick introduction. We will now go through why need data management, do an architecture review of our tool, and then Josh will deep dive into the demo, which is pretty exciting. So let’s start here.
Need for metadata management. Let’s start by asking two basic questions. What is metadata and why do we need to manage it? Well, if you are a user or a developer or an analyst of any traditional database, like MySQL, SQL Server, et cetera, you will be aware of impacts that can occur if there are changes any kind of metadata. And when I say metadata, I basically mean table names, database names, column names, real names, et cetera, et cetera, you all know about it.
Sometimes the impact is as simple as I just need to update my query to accommodate the changes, or it becomes as big as bringing all the downstream systems down with multi-vent events across the organization. And hence we need a way to govern our changes, especially around metadata, for our tables, columns, databases, et cetera in our data space. That’s pretty much metadata management. To manage the metadata, to govern it, to simplify it, there are different flavors of managing metadata. It may sometimes vary from organization to organization, and it may also vary from different teams in the organization based on what capacity you need to do it. Sometimes it may mean you just manage through emails to the whole group. You just send an email saying, “Hey, I’m updating this column name”. Or you may have to rely on more automated systems depending on how your team, it may or may not have impacted the downstream systems.
Well, there is one part to it, but another part is taking approvals. So, while informing other teams about changes is one part, taking approvals is another. For some, it may not be approval from your team lead, but for others, it may mean approval from different downstream systems, team leads you may be working with. While you are working on governance and approvals, making metadata changes to protection systems, needs to be more automated. In order to protect your production data lake, you would want all the changes going in through an automated process, hence eliminating need for manual intervention. This makes their systems more reliable. It also builds a more robust, mature and reputable process.
Another important thing, is usually overlooked, is easy to maintain. While you are building this process that can help you with governance that can help you with the approvals. It needs to be easy to use and easy to maintain. It shouldn’t take you a long time or maybe more time than you actually do to develop, to maintain your metadata. In that case, it will not be widely adopted and adoption is the key here. So always keep in mind something like this. When you want to make it more widely usable, it needs to be easy to maintain.
Now, what would be the different use cases that you would want? Something like metadata management at the more automated and [inaudible] level. But one obvious case is if you have an enterprise-wide data lake or a data warehouse, you would want to manage your data because there are so many downstream systems, a lot of people using your data, it just becomes more important to govern your metadata, to protect your metadata. The more number of people or teams you have using your data, it’s just more critical it becomes and transparent about the changes.
Now let’s talk about what we have built here at Northwestern Mutual, for a field reporting team in terms of metadata management. We have a lot of downstream systems and hence it was important to put some governance and approval in place. I briefly touched last time on, you need it to be easy to maintain and easy to use. Hence, we have adopted the philosophy of making a config driven approach. Well, it’s basically a YML file, which obviously you can make it or create it very easily. It’s easy to read, even for people who are not developers. Business analysts, product managers, anybody can come and read these files. And hence, we want it to be easy, so it can be widely available. That’s the whole philosophy this tool is based on.
In coming slides we will deep dive into what config file means and different options on the config files. You could use this tool to manage your schema at the table level. So let’s say you have ID field, that’s an integer, name, that’s a string, payment, that’s a decimal. All of those would be managed at the schema level file. So any changes that you may need to do, is basically updating the file, checking in and the tools will automatically create to make the updates available to your data breaks or the data environment. It is also used to do access management. Now, what I really mean by access management is, like I said, we have all these different things, accessing our data, accessing our databases, different views. We want it easy to say, well, we have a new team coming in now, who gets access, how do they request for this access?
Well at this point in time, it means as simple as, you make the config changes, you add your AD group and then the tools goes in and there’s all different kinds of grants to give you access to all the tables and databases that you need to. In addition to that, what we also do is our PII data management. Security is the first [inaudible] at Northwestern Mutual. We think security first. We need to protect that PII PHI information. And again, because it’s so important, it needs to be easy to be identified and inputted. And hence, what we have done is, in our metadata tool, whenever anybody is creating a table, they would identify what’s PII and PHI information, and the tool will go ahead and identify those at the table properties levels. And then again, access management, we would say, well, who has access to PII information? And then the tool goes in and creates the grants accordingly. So they will have appropriate knowledge about it.
Now, it’s not just important from creation point of view, it’s also important when we want to, after a year down the line, we want to know who has access to our PII information. At that point in time, it means you look at the config files and from there on you can see, okay, these users have access to PII information.
So that’s all about our metadata tool. Now from development stack perspective, we use Python as a core programming language and use Git-ci as our CI tool. Our data lake is mostly on the Delta tables here, due to all the different benefits it gives. Hence for our talk here, we are going to limit this to managing our metadata to the Delta tables.
So let’s talk about our design here, on how our metadata tool is created. Over to the left on the CI/CD box, if you guys look at that, there are two steps to our process here. One is the plan job and one is the applied job. If you guys use Giro form or something like that, you would already be aware of what something like this is doing. But in a nutshell, plan job will tell you what it is it will be doing when you apply a job. So it’s not like it will always go in and make the changes. Let’s say you are creating the table. It will spit out, create table scripts. If you are updating a column, it will spit out update table columns, and once you verify that everything looks good, you can then go ahead and hit the apply button and that will actually move all the things to the Databricks environment.
So our step one here over to the top, going to the S3 bucket, as soon as you make changes to the template file, and I think this will be more clear when Josh goes through the demo here, but our step one is to upload the config file to our S3. We use S3 as our storage for all of our config files. Our step two here, over in the middle, is, runs and wait for the Jobs via Job API. So we use Jobs API to do lot of our analysis here. It goes into our metadata framework job, which is on the right. And our step three here, is it reads the plan output from the plan job via the JDBC using Databricks Delta. So that’s mostly how our plan job works.
Now on the apply job, you basically take in the Job API via the Jobs API. It will actually apply the changes that it has already built using the plan job. Over to the right bottom, on the metadata framework job, the job has two stages like we talked about. A plan stage and apply stage. We have different planning modules in a metadata framework. A Database planner, a table planner, and a view planner, that’s basically all the different options that we provide right now through our metadata framework.
Moving on, we talked a lot about what are different configuration files. We talked about how this whole framework is driven by the configuration files. Let’s deep dive, what does these config file means? So starting from the database, as I said, these are all easy to read YML files. So I think from the left here, at the database level, what information would you give when you are checking in the database file? You would start with the name, owning team, a description, what kind of access, like I talked about different AD groups, who has PII versus non PII, type, which is data. And then coming to your middle here, what are different table options here? So we would start with, what is the name of the table? What is the description of the table, provide the schema for the table, what are different encrypted columns versus masked columns? And the way it’s organized is, in the database, you have all the files for the tables and the views so it kind of knows that this database I’m supposed to create all these tables.
Now what does encrypted columns and masked columns really turn into? Well, they are basically table properties and our ELT job then uses these table properties to encrypt all of this column into a column level encryption. The reason for me saying this is, you can make it to use whatever you want or whatever makes sense for your organization. Maybe it makes sense to put owner in, maybe it makes sense to put team in. It can be any kind of key value pair. So the limit is the sky, the sky is the limit here. You can use it with however configuration you want at the end, [inaudible] as a key value pair in a table properties.
But pretty similar, what does it mean on the views? You have the name in, you have the database in, you have description and you have a query in. These are the config files, the major config files and sort of like a 10,000 field view. We have a lot of different intricate details here, but like I said, it can be configured to what your organization needs. I think we talked enough on how it would work. So we’ll do a quick demo. We will create a database, a table, and a few config file, check in to Git, we’ll run the CI/CD pipeline and we’ll see, how does the plan look like, and we would apply it to our int environment, verify the database table and view in Databricks And then we would actually also update the schema and show you guys how that all works. So Josh, over to you.

Josh Reilly: Thanks, Keyuri. I’m going to walk us through what this looks like from the developer’s perspective. This is the repository where we keep all of our config for this framework. Inside the config folder, we have a folder for every database in our data lake. We have the stats database. This database has the table that the metadata framework uses to store its plan. I’ll open up the database config for that. And here’s the metadata plan table. For our demo, we will be creating an AI data summit database along with a couple of tables and a view. So that’ll be its own folder. For the database config, we’ll have two databases, one to hold the tables and the other to hold the views and to have access granted. Now we’ll pop over to Databricks. Our Databricks environment does not have the AI data summit databases yet. Let’s see, it’s mostly empty.
We also have a table for attendees, and if we had columns that needed encryption, you could add them here. And we also have a table for organization, and we also have a custom view for all attendees. The schema for these tables, as well as the query behind the view are kept in the same folder. So you can see schema attendees, schema organization. I’ve committed these config files already, so the next step is to run our CI/CD. And our CI/CD has a job for each stage of the framework, plan and apply. Give it a second while it’s tuning up.
Okay, let’s run the plan. The plan job has a bunch of steps, including flip installs. So I’ll fast forward through that. All right, here it is. Merging our config files together and uploading them to S3. And then at the bottom of the screen, we have triggered the framework to run through the web API on the plan stage. And right now we’re just waiting for it to finish. While the job is running, we’re pulling the web API for status changes on the job. Okay. So the CI/CD has printed out the status of the job as well as the plan for the changes that the framework will make during the apply stage. It read these out of the table, using JWC.
It will create an AI data summit database. It will create a folder and initialize the attendees table. It will create a folder and initialize the organization table. It will create the AI data summit view database. It will create our custom view for all attendees and it will create a view for each of the tables. Remember that we wanted to grant permissions to the view database rather than the database with the tables. So we have customized our CI/CD and merge config step so that it publishes a config that has all of the tables generated as views for us.
Now that we’ve reviewed the plan and it looks okay, we can go ahead and run the apply job. This job needs to do a PIP install as well. So I’ll fast forward through that again. Here it is triggering the framework for the web API apply stage. The CI/CD printed out the job ran successfully so let’s go take a look in Databricks. We now have our two databases that we expected. The AI data summit database has the two tables that we wanted and the AI data summit view database has the three views that we wanted.
Now let’s take a look at an update scenario. To update the schema on one of our tables, we can edit the schema file. So here I’ll open the schema of attendees, and I will add a created timestamp to our attendance table. I’ll push that into Git. Now we can run our CI/CD jobs again. Starting with plan. Plan shows that the attendees table is going to have a column added. This looks good. It also is going to recreate the view so that it has the created column in it for attendings. This looks okay. So we’ll go ahead and run our apply. And the applied job shows that it completed successfully. So if we jump back into Databricks, we should see that the table has a new column created. Yep. And we should see that our attendance view also has the created timestamp column. And it does. Back to you Keyuri.

Keyuri Shah: Thank you, Josh. That was an amazing demo. So we did see how easy it was to manage our metadata here. It wasn’t as complicated as we have to go in and change 10 different files in 10 different systems. We can just manage it all within using a single config level YML files, at any level you want not just make it automated, but we can also put in some fine-grained access control, so you all can tighten your security more. We can put up rules in place. Now that it’s all in a single source control, however you want to, either your team leader approves, or your director approves or approvals from various downstream systems, all that can be baked in with respect to your source code approvals. We also saw how easy it was to just catalog your PII/ PHI fields.
A little bit out of extension to it, what we do is we have a similar ELD framework where we now consume this PII/ PHI fields and we put more of a column level encryption on top of it. There is a call scheduled in this summit to talk about our ELT framework. So if this felt interesting to you, I’m sure you guys would like that. So do attend that or watch that on YouTube whenever you have time. Not just that, but doing this whole thing, what it means overall is less outages and less impact to our downstream systems. Wrapping up, looking forward to any question and answers you have. You can hit us on LinkedIn, here are our URLs, and please don’t forget to rate and review this session. Thank you everyone. And have a good day.

Josh Reilly

Josh Reilly is a Lead Software Engineer at Northwestern Mutual. His role is to provide architectural direction as well as enable his teams to be successful through mentoring and the creation of librar...
Read more

Keyuri Shah

Keyuri Shah

Keyuri Shah has 13+ years of good experience in IT, with diversified companies. Her heart and mind expertise in designing, prototyping, building and deploying scalable data processing pipelines on dis...
Read more