A recent Gartner report found that 62% of data teams consider overcoming siloed data use among business areas the most challenging aspect of data and analytics governance. Eisai was facing this precise hurdle, and with enterprise-wide reliance on sensitive data analytics, its data team needed a holistic, scalable solution.
Join Sean Jacobs of Eisai and Matt Vogt of Immuta to learn:
Matt Vogt: Thanks for coming to our session. We are going to talk about Databricks Lakehouse and SQL analytics as regards to access control in Eisai’s usage of Databricks and Lakehouse. So we’re going to run through a little bit of who we are, and who you’re hearing. A little bit of the use case at Eisai, those needs and requirements to fit the opportunity and we’ll run through the solution and the results basically, the architecture and the demonstration of that solution with time for questions at the end. So I’m Matt Vogt, Director of Solution Architecture here at Immuta and Sean Jacob is my counterpart. Sean try to introduce yourself.
Sean Jacobs: Thanks, Matt. Thanks for everyone for coming today. My name is Sean Jacobs. I’m the Director of R&D IT Architecture Services at Eisai. I’m responsible largely for our data lake effort at Eisai, which we have now adopted the Lakehouse Paradigm using Databricks Delta, Lakehouse specifically. And we have put an entire governance platform around this using Immuta. First of, we wanted to talk a little bit about Eisai. Eisai in general has a mission called human health care. What that means is that we interact deeply with our patients that we work with whenever we have the opportunities and it’s appropriate that we believe that we have an obligation to our society by considering the perspectives of the patients and their families and the global community overall. And we comes in strongly in our actions and the daily use of what we do as far as creating drugs and therapies for use in Alzheimer’s treatments and cancer treatments.
It’s one of the most important things I’ve ever been involved with and take it extremely seriously. So first of, at Eisai. At Eisai we’ve had many different opportunities to look at things such as what is a Big Data Repository and what we’re going to do with all of our information? We ran into the initial things that everyone runs into a companies where often data is very siloed. And therefore, you don’t see a lot of common sharing amongst business groups in the company. We have specific groups that are also set up that way, and it’s just a historical way that the IT architecture was given them designed. And so, what we did was we decided that we’re going to break down those silos. So we’re going to create a Big Data Repository, which wouldn’t have a global data catalog, and that we were going to be able to handle both textual and medical images, specifically DICOM.
So the forms of CT scans, pedia images, MRIs, et cetera, so that we could actually clean new information and help accelerate our discovery processes and our analysis of our clinical trials, possibly in the future that would allow us to help the world in a better place following again our human health care mission. So what we were looking for was a unified data science platform, which would allow us to analyze the big data in the repository, but also share that data across the different silos in a governed fashion. How do you get to that? Well, we started doing the deep dives into, is it in Azure? Is it in AWS? And we decided that the path for us with the tools that AWS offered was that was where we were going to go, and that we looked at the opportunities that things like SageMaker brought and all the capabilities that it had and how we were going to utilize that.
And we eventually graduated to looking at Databricks. And Databricks brought us a whole bunch of things, but it allowed us to have a single point of entry for a BI analytics platform to analyze and report the data after ETL, and data science was done with the various experimentations that we wanted to be able to accomplish. And we also had the ability to put complete security around the available data sets in this environment, but it wasn’t really quite enough. We needed to not only be able to break down the silo walls, we also needed to be able to maintain them possibly, because certain data is only allowed to be seen by certain people in the company. When you put everything in the data lake, it can become unruly. You can also end up with a data Swamp if you’re not careful about all the things that you’re trying to do.
So we were looking for the ability to get this data in, get it secured, provide a high detailed audit trail of all the activities within the platform, whether it was people creating new notebooks, people creating new data sets, giving access to those data sets, creating new groups, all the things that we could combine together with what was pre-op natively offered in the AWS platform wasn’t just enough. So we ended up speaking to Databricks and talking to them about how are we going to secure our Delta lakehouse in a better way. And Immuta was the first thing that came to mind when we had done our research and encountered our options. And so, one of the things that it allowed us to do was be in the cloud again, ability to reproduce results and data science experiments at point in time to see who’s queried the data and allow for time travel across those data sets.
So over the day it needs you an opportunities, right? So we identified many opportunities for our existing system that we want to maintain our data’s original integrity, confirm it to a primary record combined, and use primary records to output data from inference data and data science models. So we followed the paradigm of the bronze data landing, taking that bronze data, consuming it, and turning it into silver data for what would be called a final record that we could use as a reference point and bring that into aggregated data in gold. And then possibly if you’re going to work with the data science model and train models and retrain models that we would end up with the ability to inference those models, and we call that platinum data. So how do we control all of this data, right?
Immuta as the capabilities for us to take all of our data sets and to enforce row and cell level integrity in a control based way that also allowed for role-based access and governance as well. And all of these attributes combined together gave us many different things, but the nice thing about it was you can join together in projects, the ability to aggregate and govern data across multiple data sources. And it doesn’t just have to be the data lake. It could be many different data sources. You could have might SQL database, you could have a Postgres database, you could have stuff in the cloud. You can have things on-premise and the whole way that it fit together for us was something that we really liked as a unified data science platform, that where we could give a full data science, data analysis as a service throughout the organization.
The last thing which is very, very important for pharma in general and everyone is to be able to access and access to the data and audit all of that access. Whether it’s what queries were run, who’s performed the queries, when was the user given access to the data sets? How were they using the data when they actually did get it back? And you have the ability to shut off access to that data as well. And it’s all integrated with Databricks and so we have our data science platform that has visualization tools built into it in the form of SQL analytics.
But we also have the ability to leverage other reporting tools or dashboarding tools or visualization tools like Spotfire or Tableau or whatever we want all through the same governance platform. Using Immuta as our gateway into the Databricks clusters and allowing us to query the data. So this one-stop shopping including incorporating things like the MLflow and the ability to time travel across the data is given us the opportunity to give the data science users and community at Eisai, unified platform that’s well-governed in a way that we’ve never had before. Matt, I’m going to pass off to you now.
Matt Vogt: Okay. Thanks, Sean. So I want to talk about a couple of challenges stepping back that a lot of organizations have to building this type of model that Eisai has, and we’ll go across three main pillars. So this concept of centralized governance and without something that can control across architectures, both Lakehouse architectures, on-premises and cloud based architectures, you really start to end up with this proliferation of roles data copies. You’ve probably heard the term Role Bloat in the industry, needing a bunch of different roles and even a bunch of different rules in order to control data. And that’s kind of proliferation of roles actually leads to complexity, which obviously is both difficult to manage, and it can be difficult to secure. And then consistency layered on top of that becomes a big challenge in that certain technologies might have certain capabilities at the row column or cell level to apply security and granularity of control.
And so, being able to make sure that the right data, the same kinds of data are protected in the same ways is a pretty large challenge with this approach. And in the migration, if you’re taking this data and you want to move to a data lakehouse architecture, you have to be able to securely move this data. But again, bridging back to the middle column of having not just sufficient security controls, both on-premises and the cloud, but again, you want to have the same types of controls, the data being protected the same way when you move them into the cloud. One of the things that Sean talked about as well is it kind of creating this data as a service allowing Sean’s customers to see and get access and subscribe to the data that’s appropriate for them and appropriate to who they are, and the function they’re fulfilling. So, Sean, if you want to talk quickly about the architecture you created here then.
Sean Jacobs: Sure. Thanks, Matt. So what we have is we’ve had Databricks running, as with our compute engines up in Amazon specifically, we’re using an S3 object back storage mechanism as is required there, and layered on top of that, we don’t have to, but we chose to leverage the Delta Lake functionality as well or the Delta Engine, so that we could again, time travel across data. And for those of you that aren’t aware of what that is, it’s the ability to roll back to a particular data set at a point in time. You can actually roll back fully, or you can actually just query in different points in time as well, which is really important for when you’re doing data science models. And maybe you might be producing some type of information that would be submitted to the FDA or to any kind of governing body where you want to say, “At this point in time, this is what the data looked like that I generated this results that way.”
And that, that was critical to us. And on top of all of this, again, we have our policies and audit trails. The Immuta brings for us to help maintain things like our GDPR standards and our HIPAA standards. Those are paramount, especially when we’re talking about a global platform that we’re building. If we’re not maintaining those standards, we don’t have the ability to see who’s had access to the data sets, when those data sets may need to be expired, how we can delete those data sets from our data lake when the appropriate time happens if that is required by your contract or by laws and government stipulations. So the whole process of this together binding these together with all the tools from AWS, Databricks combined, and Immuta, it gives us an entire data engineering platform that we can use and share with the data users.
Matt Vogt: Fantastic. And so, what we’re going to show in the demonstration is the combination of Databricks is Best-in-Class Unified Data Analytics Platform, including both the workbook, your traditional way to work with data and Databricks and Spark clusters, et cetera. We’re also going to show the SQL analytics in providing a unified governance platform across both ways to consume data and manage data in the lakehouse. And then we’ll have a few minutes at the end for Q&A.
Okay. So let’s jump in to the demo of how this thing is set up in order to satisfy a size requirements for access control. So we’re going to start in Databricks, and we’re looking at, in this case, this real-world evidence data set. And so, this is my user Matt logging in and running some analytics against it in the traditional workspace notebook.
Now, I have another user Sally and she’s also has access to the workspace, but she’s going to work out of the SQL analytics environment now, maybe build some visualizations against this data. So there are a couple of concerns with this data set. One, obviously it’s very sensitive in that it contains health care information. Two, it’s also covering a lot of different states. And so, there are a couple of different dimensions of policies we need to apply to the data set. One, is users should only be allowed to see states in which their office is located, or maybe which regions there clinics or medical centers are associated. And then two, we obviously need to protect the privacy of this data set. So we’re going to employ some obfuscations and masking techniques on the data. Now, this road level of filtering, right between these different states in a traditional role-based access control system would typically be very difficult.
So I would have to write or create a role for every single state. And then if there’s any combination of states like for a region, I’d have to create a role for that region and then assign users to that role and then create a rule for that role. And I’d have to do that for every possible combination, which gets out of hand into Role Bloat very quickly, that gets at a manageability and security. So we’re going to do this with Immuta through attributes, something called attribute-based access control. So if it come to Immuta, we’re going to offer these policies against this data set. So here I have this real-world evidence data set. And if I go into the data set in Immuta, I’m going to quickly look at what’s called the data dictionary. This is Immuta’s data catalog. One of the things that Sean talked about was having this central repository, where users can go subscribe and search for data sets, what’s out there, as well as now applying the rules.
Immuta has this classification engine. So we’re actually taking samples of data and creating tags or metadata based on what we discover inside the data sets. And you could see these tags here, right? So we discovered gender, birth dates, location information. And here we discovered this column looks like a contained state information. So rather than applying a policy against this one data set that says, “Hey, filter this dataset by state.” I’m going to apply a global policy that now applies this type of policy, this row filtering policy to all data. So I’m going to come over to the policy section in Immuta, and you can see, I have this state filter already staged, right? So it’s not activated. So my governance team can come and author these policies before they go into effect. And so, this is what a policy in Immuta looks like. We talk about usability and how that aids security and compliance at scale.
So the users can come in and see the policies that govern data and the governance team, non-technical teams can also come in and not have to read a bunch of SQL statements or range of policies and try to understand what’s going on. They can actually see the business logic right here in what we call explainable policy. So if we go back to Databrick as my user, I’m authorized in a number of Midwest regions. So I can see Minnesota and Iowa and I think Ohio. And so, these are the states where I can see data. Now, Sally, when she comes back in and reruns her analytic, the rules will apply differently because she has different attributes, not specifically a role, but she has attributes that govern what states she’s authorized to see. In her case, Florida, Ohio, and I think North and South Carolina. [inaudible] South Carolina data in here.
And so, those are the regions of data she’s allowed to see. Now, to the obfuscation, there’s a whole bunch of different data types in the dataset that might be considered sensitive names, even dates of birth, if you’re talking about HIPAA and expert determination, policies, addresses, et cetera. So we’re going to scate this data. So in Immuta, we come out of the box with a couple of templates to cover different regulatory that enforcements of policy on data. So these are essentially our templates to come into the box, but I’ve prebuilt this other kind of generalized data. I’m going to activate it as well. And we’ll take a look at it again.
In this instance, again, we’re going to use now data tag. So anywhere where I find a social security number or Immuta had discovered it, we’re going to make it null, and we’re going to going to redact names. We’re going to hash email addresses, and you’ll notice the way you author policies in Immuta is think of them like defaults. You’re going to say, “Do this as the default.” And what are the exceptions to that default? It’s a much simpler and more secure way to prevent data leakage rather than saying, “Do this for these people.” You’re saying, “Do it all. This is the default and what are the exceptions?”
So, when I come back to Databricks in my workspace for Matt, actually, if you’ll notice one of the policies that come back to it, it says “We’re going to redact names.” Unless I have a specific clearance level. So you might have eight security levels and levels one through three can see certain bits of data. All right. So basically different elements might be revealed based on someone’s clearance level. So I have a clearance level of sensitive and Sally has one of public. So she can only see data that’s been cleared for public consumption. That’s when I rerun this query, no, I still see names, but phone numbers or birth dates rather are rounded to the nearest month.
And if I look at potentially another data set like this HR data set that has no security numbers in it, those should all be nulled for me. And email addresses have been masked. This has an encrypted data set, this client data list. We’re applying policies to it as well. Email addresses, here’s credit card numbers. It’s a different bits of information being protected. And when Sally runs these queries herself or she does not have that same clearance level that I do. So she cannot see names, birth dates, et cetera, are still rounded for her.
And so, let’s say Matt and Sally need to work together on our project. So we need to prevent data leakage. I need to make sure that Matt and Sally don’t share too much information with each other. And so, this is where the concept of Immuta projects comes into play, where I have this clinical trial, think of a project like a reason people are accessing data like any other project. So I’m going to add Sally to this project, and then I’m going to do what’s called I’m going to re equalize the project. And what that does is that says, “Anybody who is in this project can see data at the same level.” And so, I have this compliance notification here that says, “Sally needs to acknowledge the purpose of the project.” And so, projects themselves can come with legal language and purposes, think of these like data sharing agreements, the legal reason someone would be able to use data for this project.
And so, Sally is going to get a notification in her Immuta, “Hey, Matt, I added you to that project.” You have to acknowledge that purpose statement. All right. So Sally has to come in and acknowledge, “When I use data for this purpose, I promise to use it for the reasons I just accepted.” And now, Sally can switch her context, switch her project context to that clinical trial, and I’m going to do the same with my user. And so, now Immuta knows all the rules on the data. We know all the information about the users, what attributes they have, what groups they’re in, et cetera. And we were going to equalize access. And so, now when I run this query, because Immuta knows I’m working with Sally and Sally is not allowed to see names, I can’t see names either, but also from a state information, Sally can’t see information outside of Ohio, as far as that’s what’s our overlap.
And likewise, if Sally reruns this query and she’s not allowed to show or share information with me outside of Ohio, because that’s all I can see, that’s overlap with her. So I can’t see Florida or North Carolina. She’s not supposed to see Minnesota or Wisconsin data. And so, we’ve equalized these projects that allows Immuta to do a couple of things is to one, guarantee or prevention of data leakage as well as now with this purpose. One thing Sean talked about auditability. So Immuta can actually see now who is looking at what data. Here’s your timestamp, one of the things Sean talked about was being able to go back in time and see not only who query what data, but at what time. And then to see what project or purpose they were working on, and what policies were against them when they ran that query. That’s what we get this full end-to-end auditability. That’s about the time we have for the demonstration, I’m going to stop sharing and please, feel free to ask her questions in the chat. And we will answer them as they come. Thanks so much.
Matt has over 18 years experience in architecture and engineering in large scale enterprise data center infrastructure. Matt came to Immuta from Hewlett Packard Enterprise where he was a Chief Technol...
Sean has more than 25 years of experience in Information Technology. He has held various roles over his career including Programmer, Solution Architect, Application Architect, Data Architect, Enterpr...