Brokering Data: Accelerating Data Evaluation with Databricks White Label

May 28, 2021 11:40 AM (PT)

Download Slides

As the data-as-a-service ecosystem continues to evolve, data brokers are faced with an unprecedented challenge – demonstrating the value of their data. Successfully crafting and selling a compelling data product relies on a broker’s ability to differentiate their product from the rest of the market. In smaller or static datasets, measures like row count and cardinality can speak volumes. However, when datasets are in the terabytes or petabytes though – differentiation becomes much difficult. On top of that “data quality” is a somewhat ill-defined term and the definition of a “high quality dataset” can change daily or even hourly.


This breakout session will describe Veraset’s partnership with Databricks, and how we have white labeled Databricks to showcase and accelerate the value of our data. We’ll discuss the challenges that data brokers have faced to date and some of the primitives of our businesses that have guided our direction thus far. We will also actively demo our white label instance and notebook to show how we’ve been able to provide key insights to our customers and reduce the TTFB of data onboarding. 

In this session watch:
Vinoo Ganesh, Chief Technology Officer, Veraset
Nick Chmura, Director of Data Engineering, Veraset



Vinoo Ganesh: Hi everyone. My name is Vinoo Ganesh and I’m the Chief Technology Officer at Veraset. With me, I have Nick Chmura, the Director of Data Engineering at Veraset. Welcome to Brokering Data: Accelerating Data Evaluation with Databricks White Label. A little bit about what we’re going to cover today. First, run through some background about both Nick and myself, as well as, why this is particularly interesting to our Company, Veraset. Talk about the goals and what we hope you take out of this session before diving into the Data Ecosystem, discussing what we call Data Primitives, some of the primitives that govern data businesses. We’ll talk a little bit about the brokerage ecosystem specifically and the challenges that we have as data brokers before introducing what I call The Brokers Dilemma and eventually showing how we as Veraset has solved this problem with a demo.
A little bit of background. Again, my name is Vinoo. I’m the CTO of Veraset. Nick here is the Director of Data Engineering at Veraset. Veraset is a Data-As-A-Service Startup focused on Anonymized Geospatial Data. We do heavy model training at scale. We were used during the COVID-19 investigation and analysis to measure the effect of non-pharmaceutical interventions, as well as, social distancing as a whole. Each year, we processed, analyzed, deliver, and [inaudible] kind of every bird you can use they’re over 2 petabytes of data. Data itself is our product. So, we don’t build any analytical tools or fancy visualizations. Our entire product is predicated on optimizing data storage, retrieval, and processing. In that sense, we are just data and that makes our challenges as a data brokerage firm unique in a number of ways. Diving into some of the goals of our session, we first and foremost want to explain the Data Brokerage Ecosystem as well as some of the challenges of brokering data that many may not actually think about. As always, concerns of data sensitivity, privacy, and security are top of mind.
We’ll dive into the conceptual notion of agnostic brokerage. So, through technological, our technological agnostic, system agnostic data brokerage, as well as, how brokers differentiate data and show you this on practice again in Databricks White Label. Let’s talk about the Data Ecosystem as a whole. The world has been inundated with data, both clean and unclean data. The Datanami says data scientists spend about 45% of their time on data preparation tasks. Depending on who you are, I’ve heard this number anywhere from 45, actually the lowest one to about 80%. When we say data preparation, we mean everything from cleaning, analyzing, just the basic, like getting data in a form that’s ready for either model training or standardization of any form. This has resulted in a few things. Just like software APIs have Uptime SLAs in guarantees, there’s now Uptime SLAs around data. We expect data to be seen coming at a certain intervals, accessible, maybe even a 5 9’s level of Uptime.
We’re on the same notion, the scale of data itself has been increasing. In the past, it was very easy to do analytics on a singular machine. Now we’re spinning up hundreds of machines simply to do basic either queries or just look at our dataset in some way. That has created a co-dependence between the analytics firms and the data brokers. One can’t demonstrate value without the other. Even companies like Databricks rely heavily on the existence of data that’s clean, easy to use, has a high amount of volume. Just like brokerage firms like ourselves require firm like Databricks to actually analyze the data, dive into it, extract insights from it, and overall just bring meaning to our data. Underlying this entire ecosystem is questions around data sensitivity, privacy, security, and really everything about how to keep the data secure, usable, and I guess safe for model training or any kind of analysis.
Now, the existence of this Data Ecosystem and the trends that we’ve seen thus far have led us to develop internally what we call Data Primitives. These Data Primitives are how we think about datasets as a whole. First and foremost, data is our API meaning as a data brokerage firm, schema changes, format evolutions, these are all major version breaks. We can’t make a major change like this, without informing all of our customers actually managing and get some better way releases of our data. And of course, moving forward with constant frequent CUs supported customer communication. As such data also has an uptime, if we change individual pieces of our data, including something as similar or as simple as the cardinality of a column, that can actually break our customers downstream pipelines. So the notion of data as an API actually governs our business.
Somewhat similarly, data itself is inherently opinionated. I don’t just mean what’s inside the data or the data itself, I mean the delivery mechanism of the data. Everything from open or proprietary data formats to technology specific formats, dictate how your users are expected to use the data. Something as simple as say, delivering data in CSV versus Parquet changes the ecosystem that users can actually use to analyze their data. Additionally, we have formats focused on individual workflows partitioning on a per spark or on a per column basis, says a lot about how you as the provider expect the data to be used. In other words, presentation of data, as a whole, says a lot about your company and your brokerage firm. So as a Primitive, the way we present data is inherently opinionated. Finally, data is only as useful as it is easy to use. Meaning used expensive data is always significantly better than unused cheap data.
There is such a thing as a data graveyard, there are a number of firms that actually procure and buy data that ended up never using it simply because data as a whole isn’t easy to use or it’s just not really necessary. So making data as easy to use as possible is a major focus of an analytics firm and a data brokerage firm. Given these Data Primitives just like in the software world, there are people that want to impose these Data Primitives on individual datasets in the ecosystem. And those folks are called the Data Brokers. Data Brokers are two data, what software vendors are to software. We make the data easy to use. We source, clean, package, and distribute data by imposing the accolades that I just described on the data. Meaning we want to make it as easy as possible to work with data and make it as flexible as possible for our customers to realize value from this data.
In the sense, our goal is really to remove the complexities associated with operationalizing data. In addition to maintaining the data and securing the data. What’s interesting about this field as a whole is really just this statistic here. This is actually a $200 Billion industry. Meaning folks who are working in the data brokerage space have a bunch of money, time, energy, and resources backing them. Additionally, looking at individual segments of data, something as simple as in the bottom left-hand corner, the average value of an email address over time is actually $89. It’s even higher for travel or a little bit lower for retail, $84. So the information as a whole is immensely valuable but it can only really be used if the data itself has the SLAs that we described on it. Today a brokerages are effectively the catalyst to allowing folks to use data and reduce that number, the 45 to 80% of time that they spend cleaning data. Now, this all sounds great in theory, but working in the data brokerage’s ecosystem presents a number of very complex and very involved challenges that we’ll talk through.
First and foremost, the making data easy to use, as a whole, isn’t easy. Everything from the technical specifications, meaning the file format as I discussed with Parquet versus CSV, to the level of partitioning, to even the data distribution, as a whole, says a lot about you as a data broker. Many firms can’t actually use formats that are CSV or Parquet. Many firms rely on certain kinds of partitioning or they’ve actually locked in their on-prem compute environments, such that the data that lives in memory can only live in such a large way on memory.
The second challenge here is Variable Compute. When you are a firm trying to evaluate procuring data, knowing the cost of the evaluation process is pivotal. Oftentimes though you actually have no idea how much it’s going to cost or how many EC2 resources you have to spin up to do a same data evaluation. That makes the Variable Compute challenge incredibly difficult. Now as a brokerage firm, in order for you to evaluate my data, I have to not only convince you of the value of my data, but somehow scope the cost of the evaluation as a whole. When in reality, you may not know how much it’s going to cost or depending on the size of the data, it could be wildly inefficient, querying it the wrong way.
Third, getting the right tools, the right environment, the right libraries, [inaudible] the right, what I’m going to call opinions in place is extraordinarily difficult. As a geospatial analytics firm, we tend to prefer technologies like Sedona, but actually getting that library installed on a customer’s environment is complicated and painful. Fourth, is what I call TTFB which is “time to first bite”. This is usually a networking term, but for us, we want to minimize the time to first byte for people interested in this type of data. This means in order for someone to buy the data, right now, we have to have them go through an is generally synchronous process of signing an NDA, onboarding, sending data, it becomes very complicated. But if you want to evaluate data and we can share that with you seamlessly, that puts us in a much more powerful position. We also want to make sure we’re securing the IP. As a brokerage firm, data is our intellectual property. If we can’t secure it and we just send it out, we immediately lose our value.
Finally, let’s talk about product differentiation. In order for me as a broker to differentiate my product, I can only really give you statistics, meaning a row count or even some level of distribution. You have no way of actually understanding and seeing the data without getting your hands on it, which again, risks securing the IP, getting the variable compute in place, and everything else we’ve spoken about.
Now, you think I’d be done, but we have even more. Live Data is significantly better than Static Data and Live Compute is significantly better than Static Compute. Meaning as a broker, if I just send you a data set and I have no idea what you’re actually doing with the data set, or I’m sending you outdated data at worst, that’s not going to be great for your evaluation process. Data changes over time, getting you the most up-to-date data that you can play with is the best way to differentiate myself from other competitors.
Next, Auditing Query Access. Once data is just sent out, it’s really hard for me as a brokerage firm to understand how you’re using that data. Meaning, I would love to see what kind of queries you’re running or what columns you care about so I can improve our quality metrics. But once the data leaves my enclave, it becomes very difficult to understand anything on that front. Worth, we define clear quality metrics and create a narrative in a sales presentation. But when we’re actually doing this in practice, it’s really hard to show how we want you to evaluate our data. So if only we had a way to actually preload a set of information or a set of steps that we would want you to follow, to evaluate our data, that would be immensely powerful. Fourth, with security and privacy, and even fifth with [inaudible] reliable permissions, we want to ensure that the data that we give you is scoped to exactly your use case. You don’t have access to any data you shouldn’t have access to and most importantly, nothing is being leaked along the way.
So let’s introduce what I call “The Broker’s Dilemma”. The Broker’s Dilemma, and you don’t have to read this entire paragraph, is really how can brokers demonstrate value, protect our IP, differentiate our product, everything I spoke about, all on making the data easy to use. But we thought about this for so long and we realized rather than trying to build something in-house, why don’t we leverage the best in class tools for actually allowing people to handle data securely, do data analytics in a workflow driven and preloaded way as well as protecting our own IP. So our solution was a partnership between Veraset, our Company, Databricks, and Privacera as a whole. We’ll talk a little bit about these firms and we’ll show you these firms live. But we use Veraset for the data, Databricks for the analytical component, and Privacera with a tight integration with Databricks, for the security and privacy of our data.
Just briefly signposting, our solution is going to show how we branded a White Labeled Databricks instance with preloaded notebooks, as well as, preloaded data. We used fine-grained access control read/write permissions to actually control who could read our data and how they could read our data along with an audit trail of exactly who was doing what and when. We also scoped in Role-Level Security. This allows us to accelerate getting data in the hand of the user and a monitored and audited way. But rather than talking about it, I’m with Bernard Nick who will actually demo our White Label Databricks instance.

Nick Chmura: Hi, I’m Nick Chmura. I’m the Director of Data Engineering at Veraset and today I’m going to give you a tour of the Veraset Databricks White Label instance. I’m going to show you how it can allow potential customers to evaluate our data without compromising our data security. I’m also going to be showing you Privacera and how that works with Apache Ranger to also increase our security and at the same time accelerate our potential customers access to our data.
The first thing I want to show you is that this is a custom Databricks workspace that we set up for customer evaluation. It has a custom built cluster for potential customers. It has an instance profile that only has access to the AWS and S3 resources that it needs. I mean this is all completely separate from our production workflow in our production environment. So, this is locked down and specifically built for this purpose. And of course, that you can also see that it’s branded, so it provides a nice customer experience and works well for Veraset.
So, as Vinoo mentioned a Veraset produces terabytes of data daily and our customers before they sign with us, they want to evaluate that data and look at it and run some simple analytics to see its value. And for us, that’s a problem because if we give them access to our full data set, then we’re giving away our IP. But if we give subsets of the data, if we could produce subsets of the data and copy that around, that also is a huge security risk. And as Vinoo mentioned, we have no visibility into what they do with those subsets of the data. It can be quite expensive and cumbersome to produce all those different data samples. So, Databricks and Privacera allow us to give customers controlled access to our live data set in S3.
All right, so the first thing I wanted to show you is how Privacera works. So, you can see here as standard spark read parquet command, we are accessing live S3 data here. In reality, this would be our actual production bucket, but in this case it’s demo data. And what Privacera allows us to do is give access to subsets of the S3 bucket by user. So in this case, I’ve denied access to April data, and you can see that it’s denied and if you expand that, you’ll see that Apache Ranger is denying the access but I’ve allowed access to the March data. And so, you can kind of start to see that data that I have here.
And again, this would be, this is a pre-built notebook that we would give to our customers so they can start to really start to use the data right away. This is how this looks in Privacera. So again, this is by user. We put in the bucket name here and then the object path and this can be adjusted on the fly in the UI. So it’s very easy for us to give a new user at a potential customer’s firm access to the data or a subset of the data and it really helps us get going quickly without messing with complicated IAM policies or complicated permission schemes. We can do all of this right in the Privacera UI. And then it also gives us auditing. And I’m going to show you more of that later. All right. The next part of this demo is about Databricks.
So, even once a customer gets access to our data, a terabyte size data set, it takes some time to start working with it. You need to load it into your environment. You need to bring it up in the analytical tools that you’re using and you need to get used to the schema and how to work with it. So, by giving customers a pre-built Databricks Notebook, kind of pre-built cluster with everything already configured, we can have them run cells and start to work with our data in a matter of minutes.
In this case, imagine you’re at Chipotle, so you can run this cell and then you can take R and then just by running the cell below, we can take our full data set, which is visits to different locations in the United States and summarize it by State. Nothing too fancy here, but it just gets the potential customer using our data right away without having to write any code at all. Down below, I go a little bit further. In this case, I’m calculating the average number of visits per Chipotle location per day in March and then I’m putting it on a map. Again, just shows how with a few lines of code, how a customer could go from our raw data to a visualization, how they can start to calculate analytics.
And for us, it starts to, it allows us to build a narrative around our data. It allows us to start to show someone how our data can tell a story and how it can add value to your organization. And again, there’s no code being written yet, right? We’re giving a customer this notebook and they can just run the cells. I go a little bit further and I can filter by State and put it on a line graph. Of course, and the customer can start to look through Databricks and see all the different visualization options. Again, the great thing about Databricks is then you can add a cell and start to write your own code and you can use our code above, you can open up these cells and look at what we did, and then start to add your own analytics and your own visualizations. The idea here is that we’re telling a story about our data and accelerating your access and your usage of our data.
The next thing I wanted to talk about was auditability and I know Vinoo mentioned this already. So, Privacera tells us who accessed our data, which user, how often, when, and what specific objects they were looking at. And so here, I show you how this looks in the Privacera UI, me accessing this March data. Below we put a denial on Vinoo and when he tried to access it, you get this denied result and it tells you what’s happening.
So this is powerful for a couple of reasons. First of course, security. If we want to know who access the certain data set on a certain day, we have that in our audit logs. But also if you produce a subset of data, and you copy that to someone, you don’t know what happens with it, you don’t know if they look at it once and never look at it again, or you don’t know whether they have a team of 10 people looking at it every single day. But by using a solution like this, we get visibility into what’s happening. So we can give someone access to our data, and then we can see how often they’re using it. And that can be quite helpful for our sales team.
The next thing I wanted to show was Cluster Control. So, I might mentioned at the beginning, this is a custom Databricks workspace that we built. The big thing for us is that data is never leaving our S3 bucket. It’s not being copied at all. That itself is a big security win. We also control IO in and out of the bucket, in and out of the environment.
So, normally in Databricks, you can go up here to a table and there’s a download button. We disabled that. If you try to write data like a right parquet command, you will get an exception and you can see here, that’s Apache Ranger, that’s Privacera under the hood preventing that, right? So we’ve locked down the environment and we’ve also gooch the cluster is pre-built and set for our customers. So we control how much, how many resources they are using. And as I mentioned before, that cluster has an instance profile that limits that access to only the AWS resources that it needs. So this really helps us lock down the environment and give our customers space to explore the data without compromising our own security, which is quite powerful.
As I mentioned, as shown here, the combination of Databricks and Privacera can help become a data company like Veraset, provide potential customers access to our data and run some analytics without compromising our data security.

Vinoo Ganesh: Thank you so much for attending our presentation. We hope this showed you a little bit about the data brokerage industry as well as how Veraset as a data broker has solved some of these problems. Thank you for the time. Nick and I are available to answer any questions you have right now. Also, please fill out the feedback form. It is really important for us. Again, we appreciate it and happy to hear any questions you have.

Vinoo Ganesh

Vinoo is CTO at Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective. Vinoo led the compute team at Palantir Technologies, tasked with managing Spark a...
Read more

Nick Chmura

Nick is the Director of Data Engineering at Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective. Nick has worked in data engineering and analytics rol...
Read more