Scaling Privacy in a Spark Ecosystem

May 27, 2021 03:50 PM (PT)

Download Slides

Privacy has become one of the most important critical topics in data today. It is more than how do we ingest and consume data but the important factors about how you protect your customer’s rights while balancing the business need. In our session, we will bring CTO, Privacera, Don Bosco Durai together with Northwestern Mutual to detail an important use case in privacy and then show how to scale Privacy with a focus on the business needs. We will make the ability to scale effortless.

In this session watch:
Aaron Colcord, Director, Northwestern Mutual
Don Bosco Durai, Corporate (CIO, CTO, Chief Data Officer), Privacera



Aaron Colcord: Hi, welcome to our session on Scaling Privacy with Apache Spark. My name’s Aaron Colcord. I’m the Senior Director of Engineering at Northwestern Mutual and I’d like to introduce you to Don Bosco, the CTO and Co-Founder of Privacera. Let’s just jump in, take a look at our agenda, and what we’re going to cover today.
We’re first going to cover a little bit of our background, our respective companies and our backgrounds, and why we’re coming here together to speak with you today. We’re going to talk, “Why privacy, security, and compliance?” Why is that important? We’ll talk about our respective approaches. We’ll begin to pace out, “What is an ideal problem solve for those complex problems?” Then we’re going to try to bring real life and actually try to bring it all together, tie it together, with something that we can use.
So my background, I work for Northwestern Mutual. We’re engaged in building an enterprise scale, unified framework for bringing data together in a democratized way. Our company has a very long, respected history of about 160 years. Compliance, privacy, and security, they’re extremely important to us, as almost cornerstones. We are now at the intersection of, “How do you make agile data play along with compliant data?”

Don Bosco Durai: I am Don Bosco Durai. I am the CTO and Co-Founder of Privacera. At Privacera, we are doing security and governance in the cloud for the open data ecosystem. We have built our solution on top of Apache Ranger, which provides centralized access management for all the services in the cloud, which includes traditional Databricks workspaces, Databricks SQL Analytics, S3, ADLS, Redshift, Snowflake, and others. We also do automated discovery of sensitive information in the cloud. We can use this classification for access control and encryption, based on classification. I’m also the PMC member and committer of the project, Apache Ranger. Aaron, back to you.

Aaron Colcord: Suddenly, why do we care so much about privacy now? Well, the way I like to frame it is we have always cared about privacy and we’ve always relied very deeply on it, but the technology was usually, rightly aligned with how privacy worked and wasn’t actually moving far ahead. But over a decade ago, or 15 years ago, technology suddenly started outpacing our ability to catch up with privacy. Suddenly, we had mobile phones, we had digital businesses, and the ability for us to actually start capturing all that data, started outpacing our ability to catch up. Because privacy is at its core, an actual policy and an understanding of how you’re going to actually run your business. So we have always really cared about privacy, but it’s really now, as we start looking at these technologies of about how to advance and bring data democratization together, where all of a sudden, we now have to start looking a little bit deeper at these concerns.
Another reason why we have to care about privacy is our governments have started really caring about privacy. Up until a couple of years ago, privacy regulations really didn’t exist because it was just assumed that privacy was built into your process. We now have new regulations. GDPR introduced a couple years ago, by the European Union, has helped us start recognizing how we actually had to implement privacy into our technical systems. And CCPA, implemented by California, also starts bringing us regulation on how to do it. More importantly, if you’re in any site, a regulated business, where you actually have audits or you have some responsibility, you also care about privacy because it’s a risk when you’re collecting this type of information about your customer.
So a good way to paint this is, if you’ve ever gone to any sort of website now, you usually see some sort of banner that pops up at the bottom that says, “Accept cookies.” That’s because browsers now also are looking at how do they protect your privacy? What do they collect? How do they make you aware of it? And so, most companies have always published a privacy policy for you to find on their website, but very few people have actually really gone in search of it, to really understand how is their data being used. That’s actually a very important aspect of us as consumers, which is actually understanding that when we interact with any type of entity, is how are they going to use our information? Do they plan to monetize it? Sell it? Trade it? Or do they just plan to use it to help improve our customer experience and allow the company to build their products better, based off of the information that they’re collecting?
So when we round up, privacy is really actually a policy and a legal obligation for any organization to recognize. And the reality is, as our computer systems are getting better, our ability to automate, our ability to understand, implementing of anything artificial intelligence, anything machine learning, it’s only going to pick up speed. That’s actually going one way while the other way, more regulations are arriving to start regulating more around privacy. So our ability to actually execute our data programs, our ability to learn, and our ability to improve our own projects and products, is actually highly dependent upon our ability to respect our users’ rights. If anything, just the ability for us to mature our own programs, means that we do have to implement the governance program to understand what’s going on inside our organization.
Now another aspect, technology like Apache Spark has really opened up our capability to democratize our data. The amount of ability to process data, to sift through it, look through it, and create new use cases have been greatly enabled by Apache Spark. Another aspect is also that almost every company now is looking at this aspect of how to actually make their data and their data programs more accessible to their business users. That’s through making marketplaces that can enrich and share the data. And then that starts begging the fundamental aspects of privacy. Who in the company can actually view it? Do they actually have the controls to protect the information from who’s actually looking at it? Can we verify that the information is actually being used for the right purpose? Those are important questions that we have to care about, in order for us to actually implement the privacy program.
And so, we’ll emphasize on the rightmost concern because, what is, “Privacy?” Because it is fundamentally different than the other two, leftmost titles of, “Security,” and, “Compliance.” Privacy at its fundamental core, is having the authorized and legitimate usage of that data. That is fundamentally different than security, because security is just protecting the authorized usage of data and keeping those who shouldn’t be using the data. But privacy is interesting because we’re now talking about a scenario where a business user may actually have the secured right to look at the data, but that doesn’t actually ensure that there’s privacy or the actual usage of the data. And privacy actually goes deeper when it starts talking the actual owner of the data. What is your consent rights? Did you actually authorize this? Did you actually want the data shared? And this is actually a slippery space because privacy is ultimately a policy implementation of your organization, based off of how you understand the legal obligations in how you do business, those intersections. It’s different because compliance, ultimately, is actually how you’re balancing security and privacy as your main concerns.
Let’s examine a couple of strategies to scale agile data with privacy. So this has now began the [inaudible] off, a little bit about how do we get to our ideal system? Could we actually build our ideal system by building a metadata layer that defines PII in the schema? Well, it’s flexible. We can certainly change it extremely fast. Users and developers can and will change where the PII is stored. That means we can actually update it fairly quickly, if we discover a new field needs to have privacy or security on it. But there are fundamental problems with that, meaning, we don’t really know is if that field is correct. And we can basically chase our users around and chase people around to find out, “Is this actually correct?”
Or, we could also take another approach where we say, “Look, let’s go build some views with the permissions and we’ll just limit those to search and users as we start understanding them.” But that’s not very scalable. It takes a little bit of work to actually build a view, takes a little bit of time to actually go deploy it, and then, we always have to show who’s accessed and why. We actually have to understand those information. And really, if you think about it, these scenarios are really security scenarios, not really privacy security scenarios, because we still didn’t actually talk about, “Was it the legitimate usage of the data?” And, “How did we track it?”
So let’s talk a little bit about further in those challenges. When we talk about the metadata being flexible, could we actually take another approach and start thinking in terms of policies? We define policies. Can we actually look at this attribute and say that this is classified in a certain way, and actually represents a certain value. And because of its certain value and what it represents to us, then we should actually secure it, based off of who’s actually accessing it and understanding that data. That seems a little bit more flexible and a little bit more to what we’re trying to do with privacy. We’re trying to secure it against the usage of the data.
Could we also keep up with our view strategy? Well, the first time we do our project, we’ll discover maybe we use 15,000 fields and we have inventoried them all proper, but then suddenly, we were asked to add one more extra thing and we discover another 10,000 fields, and then maybe another 5,000, another 2000. Quickly, you’re outpaced, your framework isn’t scaling, and you’re falling behind.
So thinking in terms of, how do you implement the same type of access control, but making it a dynamic, would bring us to an ideal state. And remember, security is fundamentally different than privacy. Security is not only a different domain, but a different set of skills. It is about whether or not the door is locked or the door is open. And we really are talking about how we’re protecting the usage of our customers’ data, the people that contributed our data, and then, how are we actually making sure that it’s properly used and leveraged?
So with this, I’ll pass this off to Don Bosco, to talk about how we can solve our problems.

Don Bosco Durai: Let’s see, in real life, what happens, right? Our ultimate goal is to ensure that we can protect our customers’ personal and sensitive information, that we store in our environment. And also, we can use them as per the consent that they give to us. There are a lot of compliance regulations out there to ensure that we follow that.
To go through a few use cases, if GDPR or CCPA is applicable to you, then your customers can ask you not to use their personal data for marketing purposes. Then, you are required to expunge the data from your dataset, which are used for marketing purposes. In the healthcare industry, a patient might ask you to provide all their personal data that has been used, so you’re obligated to provide that information to them. In other use cases, almost all compliance policies, the constraints on the original source data need to be carried forward to wherever the data has been copied onwards.
If you feel it is already difficult, in the real world it’s a lot more complicated. If you look at this data diagram, the bottom layer is [present] storage. Data could be stored in objects, like S3 and ADLS, in different file formats, like parquet and [inaudible]. And also, you have data in databases, like Redshift, Snowflake, and Synapse. The layer above that is a SQL engine. You have tools like data analytics to access data from object stores. And these tools provide structure, the presentation of that data, similar to a traditional database. On the left-hand side, you have the different personnel. You have data scientists and architects primarily accessing the raw data at the lower level of the database level. Then on top of it, you have the query evaluation tools, like Dremio and Trino, which provide both data abstraction and also performance benefits. You have a different set of users and roles accessing. Data analysts may use Dremio or Trino for doing their analysis. You have business users who may use the dashboards, or they may use BI tools like PowerBI and Tableau.
Then, in a open data ecosystem, data can move from one system to another system. Or, different tools might access the same underlying data source for different purposes. So it becomes which tool, or who is accessing this data, you have to have the same… Was your policies consistently applied across the board?
Yeah, to make it more difficult for enterprises, some of the tools used to enforce security and privacy policies could be the same. The privacy team might have their own set of requirements, which will be based on state, country, and other industry regulations. While the security team is responsible for unauthorized access to data, data linkage, encryption, might end up using a similar tool, or you may build one tool on top of the other tool. And then, there are data owners who are more concerned all of their data has been used and for what purpose. And all of them need some level of monitoring. May not be the same reports, but a slightly different type of reports are auditor cards. So now we have to make sure that when you’re implementing policies, we’re not overriding each other’s policies, otherwise, it would be very difficult to enforce anything. And also, you need to manage them much more holistically.
The best way to solve this problem is to use the right set of tools. First, we need to classify sensitive data and personal data in all systems. Then, use this classification to manage access policies, also encrypt PII fields, and also do data [cleansing]. Finally, centralize the audits from all systems so that we can do checks and balances, general compliance reports, security reports, or do attestations of entitlement. So we’ll go through in depth in each one of them.
So let’s start with data discovery. The first thing as Aaron mentioned, it’s pretty impossible in today’s scale to manually classify everything, so you have to automate everything. Then, when you’re trying to do classification, try to be as granular as possible. Example, if you have a parquet file, it’s better to classify at the field level. Seeing that a file is containing SSN, if you can argue for which field is SSN, then it’s easier for you to encrypt or mask that field in the future. Also, ensure you’re classifying data as confidential or non-confidential, it’s better to classify as detailed as possible. If you have phone numbers, SSN, email, then classify them as phone numbers, SSN, and emails.
Then, you can build a hierarchy on top of it. You can say, phone number is confidential, yes or no. Or, SSN is confidential, yes or no. The reason is, policy changes very often. Today, email address might not be considered as confidential data, but tomorrow it might change and be considered as confidential data. So you don’t want to rescan and reclassify everything. So if you have already scanned and classified email address, and if you have built a hierarchy, you can just consider email address, this is confidential data. So that way, your policies can be much more efficient.
And also, where possible, try to carry forward the source of the data as the data is moving from one place to another place. Example, if you are collecting patient data information from a hospital visit, but due to HIPAA regulations, you can’t use that data for marketing. But the same or similar data, may be available for the same user from another channel. Let’s assume the consumer subscribed for a newsletter and might’ve given their email address to you and they may be okay to use that data for marketing. So if you have the source data of [inaudible], then it can apply different policies for different purposes or uses of the data.
Now, the main challenge that comes is, how am I going to tag it and when I’m going to tag it? So unfortunately, there’s no one place or one way to tag. You have to start tagging at the time of ingest because at that time, you know the source of the data. You need to be using tools like Apache Spark, which can help you scan and classify data. In the open data ecosystem, data moves from one place to another place. So even though you might’ve tagged at the source, at the time of entry, the data may move around, so it’s very important that you are propagating the tags from wherever the data is going.
And this is unfortunately very difficult. Because all the tools that we have, it’s very difficult to keep track of the tags and where the data is going. So one of the ways you can do is you can instrument your transformation job. Job’s going on, you can put additional coordinate. Or you can use tools, like Spline and Spark, which can automatically do it for you. It can capture the [inaudible] and you can use that [inaudible] to do stitching and propagate the tags.
And there’s one more thing. You may get data into your system through a formal ETL process, but there are also users in your company, which may upload their own data. They may get data from open source, they may download from somewhere, and they may upload it into your system. And someone’s data might have sensitive information. So you have to keep a watch for that. And you have to, as soon as someone uploads any data, you can automatically scan them and classify them. And if you can feel they’re not supposed to bring those data, you should [inaudible] another target and remove [if possible].
The next is access control. Traditionally, we have been using resource based policy. What that means is you’ve been setting policies at the tables of the file level. But this is not a scalable model. If you look into today’s system, you have millions and millions of files or objects stored in S3 or ADLS. You may have thousands of tables and columns. It’s very difficult to keep the policies consistent across different systems. So if you’re using data classification, once you classify the data, it becomes very easy for you to manage a policy based on tags. So you can set who can see what tags and again, make sure these policies are enforced consistently across all the different systems. So you don’t have to worry about whether it’s a table or it’s a file, if the data is tagged, you can allow or deny it.
The other thing that you may have to do is try to see if they can use dynamic rows filter. This is a better alternative to views because views [inaudible] has been properly used. If you have a lot of use cases, it may give an explosion of views and it’s very difficult to manage. Another downside of views is the user who’s running the query needs to know the name of the view and very difficult to keep track all the different views that are possible and what they support.
With dynamic row filter, one good thing is, when the user runs the query on the same table, different set of data are returned based on the user and the purpose of the query. So if you have a data analyst, who running a query for marketing purpose, then the dynamic row filter can filter out the data for the users who are not given consent. So in this way, now you don’t have to duplicate data. You can provide the same access to the same table to two different users, for two different purposes. And in that way, you can keep changing the policy, depending upon the privacy constraints.
You can apply the same thing for dynamic masking also, for decryption also. Dynamic masking, if the user does not have permission to see the data, let’s say, email address, then it will be automatically masked. And the query will be running the same, whether they have [inaudible]. Similarly, for decryption, let’s assume you’re encrypting the data [inaudible] and there’s certain user who has the permission to see the data in clear text. Then you can dynamically decrypt it [at your own time]. In this way, you don’t have a duplicate dataset.
You should also try to plan to integrate your access control with approval workflows. In this way, you can track who’s requesting and uploading and you can put additional constraints, like time of use, duration of access, or anything else that may be required from the privacy point of view. And there’s also a tool to come up with better reports.
So let’s go to the next one. The next one is auditing and reporting. Auditing is also another challenging part. It’s so many different systems and logging mechanisms. It’s very difficult to keep track of who is accessing data for what purpose. You should try as much as possible to centralize the objects to one location. And there are a lot of advantages to this. First of all, you can see all the audit logs in one place. Then second, you can generate reports to meet different requirements. Your compliance team might require one set of reports, while your data owners may want to know who’s accessing the data and for what purpose. The security may need audit for some other reason. So all in one place, it’s much easier for you to consolidate all the accesses, come up with customized reports as you might need it. And if you have great classification, then you can also do more [inaudible], like who is accessing private data, from which system, and for what purpose.
So to wrap it up, for scaling privacy in today’s open data ecosystem, you need to implement automated data discovery, you need to have centralized access control, and also centralized audit function.

Aaron Colcord: We’d like to thank you very much for attending our session and learning a little bit about how to scale privacy with Apache Spark. For any feedback and questions, let us know and we’d love to answer any questions that you have.

Aaron Colcord

Aaron is a Sr. Director of Data Engineering who has built multiple data lakes and Data Warehouses for major companies across the financial space. He has spent the last couple of years working passiona...
Read more

Don Bosco Durai

Don Bosco Durai (Bosco) is a serial entrepreneur and thought leader in enterprise security. His earlier startup Bharosa built one of the first real-time user and entity behavioral-based fraud detectio...
Read more