Organizations are increasingly exploring lakehouse architectures with Databricks to combine the best of data lakes and data warehouses. Databricks SQL Analytics introduces new innovation on the “house” to deliver data warehousing performance with the flexibility of data lakes. The lakehouse supports a diverse set of use cases and workloads that require distinct considerations for data access. On the lake side, tables with sensitive data require fine-grained access control that are enforced across the raw data and derivative data products via feature engineering or transformations. Whereas on the house side, tables can require fine-grained data access such as row level segmentation for data sharing, and additional transformations using analytics engineering tools. On the consumption side, there are additional considerations for managing access from popular BI tools such as Tableau, Power BI or Looker.
The product team at Immuta, a Databricks partner, will share their experience building data access governance solutions for lakehouse architectures across different data lake and warehouse platforms to show how to set up data access for common scenarios for Databricks teams new to SQL Analytics.
Zachary Friedma…: Hello, welcome to my talk. I am very excited to be here at Data + AI Summit 2021. My name is Zachary Friedman and I’m a Product Manager at Immuta. This talk is called Considerations for Data Access in Lakehouse.
We have a lot of ground to cover, which includes a tiny history lesson, some primitive material to set up the final part of this talk, and finally, we’re going to build an actual enterprise-grade authorization framework in Databricks SQL Analytics. More accurately, we’re going to start to build one, but I’ll be sure to leave you with plenty of future areas of enhancement for you to go back to your teams with and think about implementing it yourself. Like I said, we have a lot of ground to cover and not much time with Q&A at the end, so without further redo, I’d like to speed through some of the basic setup material for this talk starting now. I’ll also call out that Spark Workspaces have a different access control model with respect to what I’ll be speaking about today. This talk is ultimately driving to building an authorization framework in SQL Analytics.
A lot of us were here today because of this new architecture called the Lakehouse. The Lakehouse architecture has a ton of exciting applications and implications, but it doesn’t obviate our need to continue to ensure that our data is fit for purpose, and that means appropriate data competence. What is a Lakehouse? The Lakehouse is not a 2006 American romantic drama film starring Keanu Reeves and Sandra Bullock, but in all seriousness, it’s helpful to look back quickly just to set some context. We have to go back to the late 1980’s to the inception of the data warehouse. Data warehouses arose from the need to support decisions and business intelligence applications, and they continue to evolve their architecture to handle larger and more complex data requirements. The major limitation of the warehouse is actually also a strong point, which is that it was purpose-built to handle structured data.
Now, while it was always the case in some form or another, in the early 2010’s especially, modern enterprises began having to deal with a great deal unstructured data or semi-structured data. This data was highly heterogeneous, was coming in faster, and there was a greater amount of it. This made it difficult for the warehouse to scale for many of these use cases, especially in the most cost effective manner. We began to collect larger amounts of different data, and as we did this, architects began envisioning the single system to store data for many different analytics products and workloads. We began to store repositories of data in various formats and since all data professionals seem to love water-based metaphors, we’ve dubbed it the Data Lake.
Now, with cheap and highly reliable storage and the form objects it stores, it also became much more cost efficient to do this. But, of course, it turns out that the data warehouse and databases in general actually got a lot of things right. For one thing, it turns out that transactions as of properties and data quality enforcement is actually still pretty critical, and the requirement of easily storing different formats of raw data doesn’t abate our needs for these benefits. We arrived back at our current modern times. Today, a new system is emerging to address the limitations of the Data Lake, combining it with the best elements of the data warehouse Lakehouse are effectively what you would get if you had to read the zoning data warehouses in the modern data center.
Let’s talk about some key features of the Lakehouse. We can break these features down into basic key activities which in include transaction support, schema enforcement and governance, BI support, separations storage from compute, and support for diverse workloads. But, enterprise-grade systems require additional pooling and features. Security and access controls are basic enterprise requirements. There’s also the need for additional governance capabilities such as auditing and lineage as well as tools that enable data discovery such as a modern data catalogs. A consistent theme on the Lakehouse architecture is that enterprise features such as these theoretically only need to be implemented for a single system. Today we’re going to drill down deeper into the first of these enterprise requirements and talk about how to build an access control management framework for the modern data enterprise in the Lakehouse.
Diving deeper, let’s discuss key concepts in authorization in the Lakehouse, we’ll start with a brief overview of Role-Based Access Control design components. Then, we’ll discuss Attribute-Based Access Controls, which rely on rich attributes associated with each principal to make authorization decisions. This is a model often used when RBAC is not expressive, as is commonly the case in the enterprise. Finally, we’ll talk about where this authorization layer gets applied in the context of the Lakehouse, specifically in Databricks SQL Analytics. We’ll begin with an overview of the basic primitives Databricks gives us to work with, and use those to drive towards the beginning of establishing a framework for enterprise-grade data governance in SQL Analytics.
Let’s have a discussion about Role-Based Access Control. Role-Based Access Control or RBAC basically means that to manage access to resources, we group commissions into roles, and then we assign those roles out to our users. We’re talking about forming user-role relationships. For instance, Zach has the finance role as well as role permission relationships. For example, finance has the ability to query the accounting database. Simple enough, right?
Now, in terms of defining user-role relationships in Databricks SQL Analytics, SQL Analytics gives us the nice APIs to work with, to define and maintain the user-role relationships. We can use the Admin Console, the Groups API, or the SCIM API. The Groups API is helpful in terms of keeping groups, and user memberships in these groups, in sync with a system external to SQL Analytics, such as a first party source of truth, which could be an enterprise application. In a lot of our cases, that source of truth is actually another external third-party system, and it’s called Active Directory. Databricks provides a SCIM API, which you can use to keep this system.
The basic management or maintenance tasks here is keeping accurately reflected what the groups are and which users are part of which groups. Now, in terms of defining the role permission relationship, this is all about defining the access that a role grades to a user. is_member() is one of two dynamic new functions provided by Databricks. It allows you to determine at query term if the current user is a member of a specific data, which we’re going to come back to this function when building our frame in just a minute.
Now, let’s talk about Attribute-Based Access Control. Attribute-Based Access Control is all about representing fine-grained or dynamic permissions based on who the user is and their relationship to the resource they want to access. In its most simplest form, this user relationship to the resource can be expressed as a JOIN on user attributes and values of a resource column. Now, technically RBAC is a simple variant of ABAC. RBAC allows you to express logic based on user.role, whereas ABAC allows you to express logic based on user.attribute, where attribute can be any attribute of the user.
Now let’s talk about the three different types of access control dimensions in SQL Analytics. We have table level access control, row level security, and column level security. An example of table level access control would be that a user can access sales data, but not financial data. An example of row level access control would be that a user can access a particular sales opportunity or a sales opportunity matching certain conditions. Finally, an example of column level access control would be that a user can access only certain fields of a record, and we can mask the values of a column depending on who the user trying to access this data is. That brings us to a quote from a very smart person and that is, “You’re going to need a framework to manage all of these access controls across your entire Enterprise.” Of course, that was said by me just now.
You’re going to need a way to consistently think about and scale policy grants to your entire organization, unless you want to be saddled with manually creating and dropping roles, manually adding roles to users and table access to users. You would probably find it extraordinarily difficult to get much other work done if this was your daily job. Now, with this context and primitive objects set, let’s talk about the requirements for an enterprise-grade access controls framework.
In terms of table level authorization requirements, we can really break this down in terms of a few key capabilities. We want to be able to easily grant access to tables and views by virtue of a user being a member of a group, which is our typical role-based access control example. In Databricks, it’s just easy enough to mend with existing data object privileges on Databricks groups, provided we do the work to keep the Databricks groups in sync with our corporate directory or other system of record. We also want to be able to express authorization grants to tables and views based on whether or not a user possesses an attribute, which can be thought of as a simple key value pair, which combines with the Databricks user to create a Triple that we can use to enforce control at the table.
Our simplistic framework will just discuss these two methods of table level access controls. Most enterprises will want to also start to think about other means for governing access at this level. A few examples that we won’t directly touch today include implementing a request, approval workflow where a user can request access to a table and be granted access when they are approved by an admin. Most of you are probably today handling this with email and Slack, but there is actually a way that we can implement this native to the data warehouse or Lakehouse. We also have tables in our accounts, which are safe to be viewed by all users at all times. These won’t be discussed because the solution here is a straightforward, simply grant access on these data sets to all users and be done with them. Similarly, certain data sets we want to express access to in terms of simply granting individual users access to tables. This is also a straightforward implementation of “GRANT SELECT ON TABLE name TO [email protected];”
Certain cases also arise where we want to be able to grant access to specific users, groups, or users possessing specific attributes, but only for a specified period of time. This sort of leased access to data sets is a common enterprise requirement. Don’t we won’t have time to discuss its implementation in this talk. Finally, going somewhat hand-in-hand with time-based access grants, a situation might also arise where we want to explore adding purpose-based access controls to our set of table-based subscription controls. This could be a purpose that a human user would want to carry out such as marketing analytics, which require access to user specific data for cohort analysis, and it could also be a system level task, like an ML model, for instance, for fraud detention. This takes care of table level policies for the most part.
In terms of row-level security, we have another set of requirements that we want to include in our framework. We’d like to be able to filter which rows are returned to a specific user based either on their role or attribute. Additionally, similar to the time-based table access controls, we also like to be able to filter rows, that don’t match the certain recency requirement that we’re able to specify.
Now, perhaps the deepest access control dimension we can discuss here is column-level security. We could honestly take up an entire 60-minute session just on advanced Privacy-Enhancing Technologies or PETs and differential privacy. Unfortunately, for this general authorization talk, I’ll just have to lead this as a topic for research back with your teams, but suffice it to say there is a lot of fancy policies we could apply to data at this level.
For our basic framework, we’ll implement the ability to use roles or attributes to do some relatively simple column asking. This should represent a Pareto optimality for a decent portion of your teams’ enterprise. Let’s talk about implementing the first part of our framework for managing table-level access control. Our first situation, which can be solved with RBAC, is that we want to state that users who are part of the Active Directory group called Finance are allowed to read profit loss data. Now, provided we’ve kept the groups in sync between our corporate directory and Databricks using either the Admin Console, Groups API, or SCIM API, then we can solve this requirements simply with “GRANT SELECT ON TABLE accounting.profit_loss_statement TO finance;” but soon we have other emergent requirements such as users read the attribute executive are allowed to read sales data. This one is a bit more complex. First, we’ll need to store a (user, name, value) triple in some sort of attributes table. Next, we’ll need to create a secure view on top of the original table, since we can’t pass a WHERE clause as a principal, only user or group.
The first thing that we need to do to solve for ABAC in our framework is create a database. I like to call it “security,” but you can call it whatever you want. We need to create a users table, which will store each user in our system, and for each attribute name and value that a user has, create a row in that table. The grain of this table is user, name. One thing you might want to be conscious about is that you might not want a grain public access to this table, unless it’s okay for any user in Databricks to be able to see all the user attributes and values for each user who has access to SQL Analytics.
We can create a VIEW which can basically be thought of as a rich object that adds active views to the existing data bricks dynamic view function called current user. You can grant public access to this view and the individual user will only be able to view their own attributes. This would be like calling a dot method on CURRENT_USER, but we can’t do that, but we can select attributes from this field. Now, putting it all together, users with the attribute, “executive” are allowed to read sales down, so assuming that we have a database called “analytics” and a table called “fct.sales” that has a “sale_id,” “amount,” and “territory” column, we can create a VIEW called “analytics.sec_fct_sales,” selecting all the columns backstage and inner joining it with “select * FROM current_user WHERE name = ‘role’ AND value = ‘executive’.” This will filter out all the rows of the sales table unless the user has the right attribute name and value.
Now, let’s consider our next access control dimension, which is row-level security. This is about satisfying your requirement that says something like, “A user can access a particular sales opportunity, or a sales opportunity matching certain conditions.” Let’s consider a sales dataset that has a territory column, and we only want users with the attribute territory to be able to see rows with the corresponding value in the territory column. To make this more concrete, let’s look at the first seven rows of our “fct_sales” table. We have “sale_id,” we have “amount,” we have “territory.” We want to make it such that that user with an attribute: name, territory, and value of x can only see the rows that are matching x in the territory column. We’ll insert a new row into our users table with a user, name of territory and in my case, US-EAST. I now belong to the US-EAST territory FROM name attributes, for example.
We’ll then add on to our secure VIEW a WHERE clause that says, “WHERE territory IN select * FROM current_user WHERE name = ‘territory’.” In my case, this will resolve to where territory in US-EAST. Now, looking at our sec_fct VIEW, we can see that sail_ID’s 1, 2, and 7 are visible to me and sale_ID’s 3 through 6 are “NO” because of the matching on attributes value in the territory column.
Finally, we have column-level masking. In this case, we don’t work at a very transparent company and only executives can see the amount of a sale. In this case, instead of simply selecting amount from the fact table, we can add a key statement. We can CASE on SELECT * from current_user WHERE name = ‘role’. When it’s executive, we can pass through the amount and when it’s not executive, we can just return “NULL.” In my case, since I was an executive… The row-level policy is still being enforced. I can only see 1, 2, and 7, but I can also see amounts for any rows that I can see. For a user without the executive attribute, all of these amounts would then be masked “NULL.”
That’s it for our basic enterprise-grade authorization framework. Thank you so much for coming to my talk. My name is Zachary Friedman. I’m a Product Manager at Immuta, which provides an Enterprise-grade access controls platform to Data teams just like this. Please ask me anything and please don’t forget to rate and review your session. I look forward to hopefully seeing all of you at Data + AI Summit 2022 in person. Thank you.
Zachary is a PM at Immuta. His team builds a common security layer that governs all data access directly in the warehouse and as a proxy, intercepting all ODBC and JDBC requests and applying a securit...