The volume of available data is growing by the second (to an estimated 175 zetabytes by 2025), and it is becoming increasingly granular in its information. With that change every organization is moving towards building a data driven culture. We at Northwestern Mutual share similar story of driving towards making data driven decisions to improve both efficiency and effectiveness. Legacy system analysis revealed bottlenecks, excesses, duplications etc. Based on ever growing need to analyze more data our BI Team decided to make a move to more modern, scalable, cost effective data platform. As a financial company, data security is as important as ingestion of data. In addition to fast ingestion and compute we would need a solution that can support column level encryption, Role based access to different teams from our datalake.
In this talk we describe our journey to move 100’s of ELT jobs from current MSBI stack to Databricks and building a datalake (using Lakehouse). How we reduced our daily data load time from 7 hours to 2 hours with capability to ingest more data. Share our experience, challenges, learning, architecture and design patterns used while undertaking this huge migration effort. Different sets of tools/frameworks built by our engineers to help ease the learning curve that our non-Apache Spark engineers would have to go through during this migration. You will leave this session with more understand on what it would mean for you and your organization if you are thinking about migrating to Apache Spark/Databricks.
Madhu Kotian: `Hey! Good day and welcome everyone. My name is Madhu Kotian. Me and my team are responsible for managing CRM, reporting, and related application. I’m also responsible for managing and maintaining our investment product system data org. I’ve been with the company for 15 years, and prior to Northwestern Mutual, I was in different companies working on data architecture and data engineering. I’m also accompanied by my colleague here Keyuri Shah, lead data engineer at Northwestern Mutual. Keyuri, quick intro.
Keyuri Shah: Hello, everyone. Hope you all are having a good time in the Spark Summit. I am Keyuri Shah, engineer at Northwestern Mutual. I have been with Northwestern Mutual for the past 10 years and part of various innovation projects. My most recent one was to move our field reporting BI space to big platform, and that’s what we are going to talk about today.
Madhu Kotian: We are excited to talk to you about Northwestern Mutual’s journey to transform our BI and reporting space. In this talk, we will describe our journey to move hundreds of ETL jobs from our current MSBI stack to Databricks and build a datalake using Delta, our approach for reducing our data load window, we’ll also share our experience, our challenges, our learning architecture, and describe patterns used for undertaking this huge migration effort. We’ll describe the data’s different sets of tools and framework we used for ingestion and building and scheduling, which also in a way helped our engineers with their onboarding process. You will leave the session with some takeaways on what it would mean for you and your organization if you’re thinking about migrating to AWS Spark and Databricks. Before we get through the agenda, a quick overview on who we are, who Northwestern Mutual is.
Northwestern Mutual is 160-year-old, Fortune #102 company that provides financial security to more than 4.6 million people. We delivered this through holistic approach that brings together a trusted advisor, a personalized financial plan that includes insurance and investment products, again, powered by technology to drive this seamless digital experience for our clients. Now, technology accelerates our business and the company strategy to deliver this digital experience to our clients. It also helps our 10,000+ financial representatives throughout the country to better serve our clients. We have been in the digital journey or transformation for a couple of years, and the pace is accelerating more rapidly than ever before to meet the need of our consumers today. We are based in Milwaukee; our headquarters is in Milwaukee with a stint of 4,600+ employees. So, these are the four pillars of Northwestern Mutual. I’ll go really quick on this.
When it comes to Commitment to Mutuality, Northwestern Mutual is a mutual company, meaning we aren’t reporting to shareholders or Wall Street, only to our policy holders and our beneficiaries. This allows us to plan for a long-term benefit for our client and think differently the way we manage the finances and investments. Financial Strength is paramount for the value we provide to our policy owner. Now again, this is affirmed by different rating agencies. We are only two companies in the US with the highest financial strength rating; Microsoft is the other. So, when it comes to Exclusive Distribution, it really relates to our financial advisor. It’s about providing this human-to-human interaction, which is more about building trust while providing this financial advice to our policy owners. All this is not possible if we don’t provide best-in-class service and a long-term product value. That’s a quick word on what we do and what Northwestern Mutual is.
So, what I’m going to do is quickly go through the agenda and then give an introduction about what our team does from a reporting and a BI platform perspective. Keyuri’s going to go through a quick overview of our before and after migration strategy. She’s going to go through some of the framework approach. I’m going to come back and talk about the migration approach, what we did, and then key takeaways, the lesson learned. This is going to be something which you can take back to your respective organization if you are planning to transform your respective space into AWS and Databricks.
Our team. A quick overview on our team. This is a team of 30+ engineers who are responsible for building, managing, and reporting platform that collects data from different heterogeneous sources, our dashboard system. We collect the data, curate them, we do a lot of aggregation. We do also build a dimension model, so that we can render this to report our dashboard needs. Our goal is to provide the scan, and on-demand reports are two different insights with different insights to our end users. We also work with different home office business partners, like our in-house users, and provide them an environment to perform this adhoc analysis. But, that’s pretty high-level on what our team does, so let’s hear our world before migration from Keyuri.
Keyuri Shah: So, what was our role before migration here? Like Madhu said, we are a reporting platform. We get data from all different kinds of sources, and then, we get data from almost 30 to 40 different databases, which are all different from UDP to MySQL to SQL Server, also in just a lot of flat-files, in terms of XML files, CSB files. Over to the left, in the big picture here, we can see all the different data sources we get from our on-prem system and our cloud systems. Before migration, we used to use SSIS integration services, and it just didn’t curate all of our data. Once the data is curated, we store them over to our next layer here, which is the data storage. We used to use analysis services and sequel databases in Azure to store all of our data. Our data size is almost two to three terabytes at that point in time. From there, we use Power BI and SSRS to do our visualization here; already typical MSBIs, Microsoft MSBI stack we use.
Now a few quick stats. How would they be before? We had almost 300 ETLs in our SSRS that we used to run overnight. It used to run for approximately seven hours nightly. Any new feature that we would get would take around five to six weeks to put it out to the market, which means the time we get a requirement on an average. We have to get new tables, curate them, put it into Power BI/SSRS, and then we ship it out to our production and our end users, so that timeframe was approximately five to six weeks, which was a long time for our product and our business to rate. So, tacking onto that, what were our pain points in that before infrastructure or our data space? Well, everybody in the organization, and even at your organization, at your team level, you would have seen this trend in increased data volume.
This was increasing the granularity of the data that’s available, increasing more data available, so we have to start ingesting all of this new data because our business wants different reports, different analytics, out of this huge humongous data. Well, with increase in data volume, we increase our latency. When we started, we were running at two hours of bad cycle, but we kept on growing and growing and growing, and now we were at seven to eight hours of bad cycle. It’s going to keep growing; we are not stopping here. So, it’s a strictly near latency; our latency increases with the new data load.
Also with that, what increased was our data inconsistency and data sprawl. We had to duplicate data at different places. Like most of the industry standards, we have the reporting database, we have a data warehouse, we have a data for analytics use, we have a database for APIs. Because of all of that, there’s an increase in data redundancy. That adds onto all different kinds of problems saying when data doesn’t match that and things like that. Another biggest pain point from our consumption standpoint was for our analysts. Our business users, our analysts did not have an integrated data to work with, so as part of their analysis, what they had to do was they had to bring data from all of these different databases at one place, and then start doing that analysis. It wasn’t the ideal place for them to do their job. Then, of course, within increase in all of this, it increases cost. New, more database increasing costs, more computing fees and costs.
Our other main reason was because we were a typical data warehouse with increased storage, we had to increase our compute too. So, the cost keeps on growing exponentially here, and then, security. As a financial company, security is most important to us. It was just getting challenging to manage that highest level of security with all the other problems that we had going on. Keeping this in mind, we decided that it was time to move to a more modern, more scalable, more secure platform. When we decided to move to a new platform, what were our different considerations? Well, our bread and butter: how do we do ELP? And how do we schedule all of our jobs? Performance. That was a major thing that we were looking for. Seven hours, and it kept growing. That was not at all ideal.
Second important thing was any new tool we use should be easy to maintain, easy to use, and easy to learn. We wouldn’t want any developer to go through intensive one year, two years worth of training, and then start contributing in our agile world. We had to pivot quickly. We had to learn quickly, and we still had to keep producing our business results. Another main concentration was when we compute our scale. When we scale our compute and our storage, we want to manage our costs accordingly. We keep on getting more data doesn’t mean our cost has to increase exponentially. We had lots of complicated dependency. We get data from 30 to 40 different sources and managing complexity at the table level, at the job level. It becomes a huge spider web at this point in time, but that’s not something we can avoid. What we wanted to do is make it simplistic for our developers to put those complexities in place. For anybody who is debugging, make it easier for them to understand all of these complicated dependencies.
Moving onto the middle here: our datalake, our storage. That’s another main pillar when we talk about our new platform. Our datalake had to be governed, so every metadata we put in a new table, a new changing column, all of that had to go through a good governance process. We have so many downstream consumers that use our data; we don’t want them to come down when we are making or improving our system. That was an important factor when we consider our new platform.
Support asset operations. Typical insert, create, updates, delete, all those should be possible at the table-level. Ability to look at the past data. What that really means is every night we do a nightly job, but we also want to go back past three days, four days, and see what was the state of our data at that point in time. It’s important when we try to do time-bound reporting and things like that. Then, data validations and data quality. Every night, when we load our data next morning, when our field users come and log into their report, we want to make sure that the data that they are receiving is of the highest quality.
Moving on to our third pillar here: security. As you can see, it has a section on its own, which means it’s of utmost importance. While we do encryption at rest, encryption in transit with all of our implementation, column level encryption is a major requirement to protect our PII and PHI information. We don’t just want to protect it against hackers, but we also want to protect it from all of the people that are accessing our data: developers, testers, and admins. Only those people who are authorized to see the PII/PHI data should be able to see it. That kind of moves me to our next point, which is role based access to databases, tables, and views. We want to manage who gets access to what data easily, which means one group can get access to a certain set of views, and another group can get access to another set of views. Everything should be Active Directory-enabled and Active Directory-controlled. Azure Active Directory is one of the major grouping how we do at our organization at all different levels, so it’s very important that whatever tools we choose has the capability to integrate with active directory.
While doing those considerations, I want you guys to take a quick look at what does our world after migration look like. Obviously, on the left, our sources did not change. Another decision, what we made on our right, our visualization layer did not change. So, how are reports, Power BI, SSRS reports were rendering. Nothing on that changed. What changed is our middle box here. Instead of ingesting data through SSIS, we are now using Databricks and Airflow for scheduling. We have built custom frameworks on top of our Databricks and Airflow to make sure everything is easy to use and easy to manage for our team, for our organization. For data storage, instead of SQL Server, we moved all of our data to a datalake, and we use Delta tables. Then, that table supports all the asset operations, time travels, which was the features that we were all looking for in our datalake.
We also have MySQL Instance, which we manage for our high latency/low latency use case. I put the stats in the same comparison as our before slide. How did we do after our migration? Instead of ETF files, now we are counting the config files, and we can go through in a little bit detail what does that mean.
We are now over 500+ config files in our frameworks. Our huge difference is what we saw with our batch cycle times. Instead of taking seven hours, it now takes only two hours. Mind you, we are now ingesting more data than we used to do before we made some changes in our ELT because there was always need for more data. Now that we can ingest it with more data, our cycle time came down; that was a huge win. Another good is time to market. We used to take approximately five to six weeks, but average level requirement is now practically down to one to two weeks with all the approvals, with all the CICB processes we put in. Just by the nature of tools and technology, we can now put new requirements in for one to two weeks. So, that was a amazing change.
But how did we do all of this? Like I touched a little bit on the previous slide: the config files. What does that really mean? We picked various development frameworks to help facilitate our developers. First of them is our ELT framework. It sits on our Databricks and uses [inaudible], owned by Spark. Basically what it is, it takes some config files. They are simple to read, simple to apply JS ON files. It has different options, like you are trying to read from a DB2 database versus you are trying to read from a CSV files that developers can put in, and the destination level. How do you want this curated data to be placed at? That helps our developer not write any complicated by spark for every ETL. Every ETL now basically gets converted to a config file, which developers check in. It really makes it simple to use not only developers level, but even from support perspective and deep working perspective. We basically go in and we use those JS ON files.
It’s also CI-CD approach. So, that helps with the approvals and all. Another thing important that we have put in our framework is called level encryption. How we identify PII/PHI columns, and then the framework takes care of encrypting at column level. Yesterday, there was a talk on the same topic, so if you guys are interested, please watch the recording on it.
Moving to the middle here on the metadata framework. Pretty much same philosophy. It uses configurable YML file this time, but it’s basically schema. You provide column names, databases names, access requests, everything through a config file, and the framework goes in and takes care of creating all the databases automatically, creating all the views, applying all the access controls, all of that. So, there’s a talk scheduled today from 5:00PM to 5:30PM. If you guys are interested, please join us. Similar lines over to our airflow framework. We have a very complicated dependency management, and we didn’t want our developers to write hundreds and hundreds of lines of code to manage all the dependency at every table level. We abstracted all of that down to a config file, so now what developers have to do is put a schedule, put all the dependencies for the table, and then the framework takes care of creating a DAG [inaudible].
If anybody has more interest in knowing what that does or how it does, feel free to connect with us on LinkedIn, and we can get together and get more information for that. Now, how did we do all of this? I will hand it over to Madhu to talk about our migration approach here.
Madhu Kotian: When it came to our migration approach, the first thing was looking at the team setup. We started with a small group. You may all have different scrum teams to manage your respective data analytical platform and manage different business priorities. What we did in this case was pulled a couple of engineers from different teams, our scrum teams, to form this cohort group. We provided a sense of autonomy to them to perform these incremental experiments and POC and incrementally mature the platform capability. The key here is to use your existing skillset, existing team, because they understand the business rules and the logic behind it. Bringing someone from external, you can work through it, but sometimes it can delay your progress. So, that was a learning here.
With respect to code migration, when we talk about incrementally maturing the platform capability, we started with one end-to-end flow with an intent to make it production ready. This gave us an opportunity to enhance our framework as we moved along. It gave us a good view on what that end-to-end will look like and what the shortcoming of the frameworks were. We incrementally imploded as we moved along. Also, for a migration, we did not consider lift and shift, and we did not use any of the accelerators to fast track the project. Now, there is benefit of using accelerator. In our case, we had a lot of redundancy in our model, in our ingestion process, so we wanted to re-look at all that. We look at the entire structure and then build everything new. So, that’s the reason we ended up not taking accelerated because that would have basically taken us back a little bit in terms of our timeline.
Now when it came to the experience, Keyuri touched on that. We did not change our reporting and dashboard layer. Our goal was not to impact that end user experience. We eliminated the dependency the MSBI infrastructure application had and moved the scheduling and data ingestion and the backend infrastructure without impacting the front end aspect. Pretty much changing everything behind the scene is what we did. In essence, our end users, basically, were oblivious of all the change happening on the backend. From a validation and a support perspective, we had the option to keep our Azure environment up and running while we were incrementally building our AWS environment. This way you can go about building continuously your AWS structure, but you can also, at the same time, validate the data and compare what you see AWS. That way you know you can match the parity between them and see what’s missing. If you remember me talking about each incremental build with the validation and verification place, we also post the feature outdoor user for any feedback on performance or data quality.
When it came to key takeaways, something you can gather from our experience here. The first thing we looked at, our first priority, was to look at a business product and a security need. We approached our business with this idea. Obviously for them, they should be on the same page around what this really means to them. There should be a common understanding of the pain points we are trying to address. We ended up working with them, understand their pain point, so as you know, from a business perspective, there are numerous pain points when they are trying to deal with different data needs. They worry about dangling data. They worry about the data quality. They have different adhoc analysis and needs, so obviously they don’t get access to those different production systems. It’s very distinctive.
They ended up building their own channel ID, bring all the data together in their own environment. In essence, it becomes stale. The version of truth isn’t articulate, so there are numerous pain points the business team has been running with. We basically captured those pain points. It was a good input for our business case, as we build the justification around why we want to really migrate to this new platform. Other critical thing we did upfront was security. Again, this was something Keyuri touched on. Security is first-class citizen, and then being a financial company, we are subject to numerous audits and security reviews. We made sure that solution architecture wise, we have everything in place from an encryption perspective. We wanted to make sure that encryption in transit, at rest, and even column level encryption to product PII and PHI information. That is basically implementing the process and how to really promote access to different users to come in the system?
My philosophy of whatall is not to restrict people from getting the access, but you can restrict on what they can see. If you don’t want them to see PII/PHI information, you can put specific controls around it. It’s more about prioritizing the data, having that available to anyone who wants to consume it. For us, the key was to make sure the security controls are built into the framework. Before we enter into any POC or work around building the framework, we wanted to make sure those security controls are in place.
From a product perspective, along with business, we have digital product partners we work with, and the goal is to meet those business priority. They worry about the velocity of the team. By carving out the team’s velocity, upfront work on this feature step was very critical for us. As we conformed on the experience, we were quickly able to onboard the rest of the team and adjust that velocity. They take on additional innovation plus deliver on some business priority. That’s what we did with our business product and security team. The other aspect is also showing and proving your progress, so as you’re incrementally learning through the process, deploy this incremental change and get immediate feedback from our stake holders. Also communicate. I think communication is very critical. Cannot stress enough. Don’t use this as your secret project, right? You’re innovating. You don’t want to share with many people, but I would say share it because you will be surprised to get a fair amount of feedback from them, which is going to, in the end, benefit your product.
The other thing we did very right was the plan to onboard our engineers, so as we looked at the reason why we build a framework, all our developers, our MSBI stack engineers. Now, again, it’s overwhelming for them to really venture into Python, Spark, and Databricks and with an intent to deliver everything in a year and a half. It’s a daunting task, so it would have been a huge lift for them actually. Building the framework was very critical for them because one they can understand, from a data engineering perspective, what is needed and the framework gave them a pathway so they can use this configurative one approach, and then, in the end, also use the SQL knowledge in transforming into this new platform. We were very meticulous in that radicalizing the training program around what we want our engineers to learn. We develop some sort of Python beginner’s training, the how-to guide. We created some sort of a workshop so that they can navigate through the process and understand what those pitfalls are, and they can learn from it.
Plus, I think the beauty was since we carved out this separate team to work on it, they were basically sent back to their individual team. They were basically the mentors placed on the team as they were onboarding.
As I mentioned earlier, we kept the fallback option open, which is Azure, so that if we ever had a scenario where AWS was not working as we expected, we always had an option to revert back and move to Azure. We didn’t have to do that, but it was a good validation for us. The other key point here is you’re working on a cool new technology, leveraging the power of distributed computing, but your downswing system is not ready for you. Please consider them and consider using appropriate amount of cluster when you’re hitting your downstream legacy system. You can bring them down. We did it, but we quickly recovered. That was a big “A-ha” and a learning moment for us. We ended up basically optimizing what level of cluster we need to open when we are elevating the downstream system.
This is pretty much our talk in closing. We are not done here. This is just the beginning of a journey. We are working on setting apart two instances, I can’t read, for our business to put their hands on SQL analytics and build on the lakehouse pattern. I hope the journey we shared here in our presentation today was insightful and hope this can benefit your journey ahead. Thank you for all listening in and looking forward for any questions you may have. We can answer those. Thank you.
Madhu Kotian is a Sr. Director Engineering at Northwestern Mutual. He is responsible for leading the mission, vision and implementation of different applications along with data and reporting capabili...
Keyuri Shah has 13+ years of good experience in IT, with diversified companies. Her heart and mind expertise in designing, prototyping, building and deploying scalable data processing pipelines on dis...