Managing a Multi-Cloud Data and Analytics Platform

May 27, 2021 03:15 PM (PT)

Multicloud adoption is gaining momentum. Gartner predicts that by 2022, 75% of enterprise customers using cloud infrastructure as a service (IaaS) will adopt a deliberate multicloud strategy. Enterprises adopt multicloud strategies for various reasons, such as preventing vendor lock-in, enabling access to best-of-breed cloud services, regional requirements, and so on. However, there are challenges with this model. Every cloud has its way of doing things, and they don’t play nice together. What if you had a data and analytics platform that extended across the major cloud providers? How would you manage it? In this session, you’ll learn how simple it is to manage the Databricks Lakehouse Platform — one platform for data, analytics, and AI across AWS, Microsoft Azure and Google Cloud.

In this session watch:
Sumit Mehrotra, Director of Product Management, Databricks
Carleton Miyamoto, Sr. Staff Engineer, Databricks



Sumit Mehrotra: Welcome everyone to the session on multicloud data and analytics platform. My name is Sumit Mehrotra. I’m the Director of product at Databricks. My co-presenter is Carleton Miyamoto, who is the technical lead for enterprise platform at Databricks. Today, we hope to discuss with you the importance of a multi-cloud data and analytics platform for your organization and leave you with some best practices as you think about implementing a multicloud data platform for your own organizations.
We will start with the meaning of multicloud data and why, why it is important. We will discuss the attributes of such a platform. We will then look at the architectural pillars and best practices for designing such a platform. We will close out with a demo of Databricks’ Lakehouse multicloud platform illustrating some of the best practices we are putting in place in our own platform.
Let’s dive right in just to put a little bit of historical context on our conversation. You all have been familiar with the cycle of technology as applications and services over the last 50 years have been on on-prem environments and slowly in the last 10 or so years have moved over from hybrid to a cloud paradigm, and now even in the last three to five years, the trend has accelerated even further where organizations are looking at not one cloud, but multiple clouds where their applications and services are running. This trend is only accelerating and behooves all of us to start thinking about such a paradigm where multicloud is a way of life in your organizations, and how you architect your data platforms to serve the needs of your organization.
Now, we asked our own customers a little while back about multicloud and their, their point of view on multicloud. And these are some of the things that bubbled up. Multicloud is a thing for a lot of our customers and I’m sure for you in the audience as well and why that is important is along these lines, data is naturally in multicloud. I’m sure in your organizations as well, there are applications and services that are running in different clouds for whatever reason and if applications and services are running in different clouds, naturally they’re producing data in those respective clouds. So your data layer naturally exists in multiple clouds, and you have to harness that. Of course, one of the big reasons for you to use different clouds might be the fact that each cloud has its own value proposition and services that it offers you, and you might have applications and services that run best in a given cloud and that is true from what we’ve heard from our customers as well.
Naturally, when you have a lot of applications services running in these clouds, you’re spending money with these clouds. So making sure that you’re getting the best bang for your buck, making sure that your spend is optimized, making sure that you have the right position of leverage with the native clouds when you are running a portfolio on these clouds becomes important, and it is important from what we have heard from customers. There are a number of other reasons, but these are the ones that we wanted to highlight and these are the ones that we have heard as top of mind for our customers as well.
Now, all that is easy. Saying multicloud is easy, but implementing it and executing on it is terribly hard, and our customers also agree with us and they are facing some of these challenges. So we wanted to highlight some of the challenges that we have heard. And I’m sure you might have also heard in your organizations as you’re looking at your cloud strategies. Number one reason that we heard was common interfaces, the lack of common interfaces. Common interfaces, and especially open and standard-based interfaces help you build for the future. You’re not tied to specific interfaces for a specific cloud and hence kind of bound by the innovation speeds of a particular cloud. You need common interfaces for that speed of execution, and you also need it for your own organization and velocity of development in the organization where skillsets are not dependent on the native clouds.
You can have tools and services that can span across your organization where data scientists and data practitioners across your organization can depend on those interfaces and tools, regardless of which cloud they’re working on. Similarly, when you’re thinking about your portfolio of applications and services across the organization, running in different clouds, you still want to have common ways of access control, policies, governing data. You want control over who’s doing what, especially on the data side of things, and also on the resources running in the clouds. You want to have a common platform, which makes it easy for you to manage such things as opposed to building bespoke solutions for each cloud that you operate on. Coming to use it, You said cost was a big concern for customers having common visibility into what is happening, how dollars are flowing, how resources are being used, how money is being spent in an optimized fashion.
You don’t want to build pillars, which are isolated from each other in order to see which cloud is impairing, which cost for you. You want the comprehensive common solution that helps you have visibility across your organization on your spend. And lastly, as you are looking at workloads that you’re running over time, your workloads change, the access patterns change and even clouds change, but you want to retain the flexibility of which workload runs on which cloud in the most performant manner, so that you have control over the innovation that you’re making. And you want a platform that makes it easy for you to execute on those technological decisions rather than having your hands tied together.
So these are some of the challenges that have bubbled up in our customer base. And we wanted to present to you so that you have these in front of you, as you are looking and thinking about multicloud strategies in your own organizations, but in order to illustrate these further, we wanted to paint a picture for you. We wanted to paint a picture for your own businesses. These could be your businesses in future, as you are growing, as your organization is growing, you may be operating in one cloud in one geography, but the day isn’t far where your business has grown, you’re looking at that level of scale. You’re operating in multiple geographies, maybe even operating on multiple clouds. Now we wanted to paint a picture of such a future and illustrate some of the things that you might have to think about as you get onto a path for that future.
So I’ll present two such pictures of the future. This is the first one we refer to it as the multi-geo business. So imagine a business where you have multiple business units, they may be running independent lines of business producing value for their organizations, and they take independent choices on the technological platform that they’re going to run the applications and services on. And hence, as we discussed, the data is also being produced based on the technical choice that [inaudible]cloud that they have made a bet on. Now that is great. That gives them the independence, but there is a central data platform team in the organization whose charter is to support all of these business units from a platform perspective, working on the common elements that you need from the platform. As a business, these business units might be running in their own silos and they might be limited data sharing, which means these business units in this vision of the future, but there is a common platform that they all depend on.
And that leads us to some of the attributes of such a platform for such a business. For example, even though you’re running in different business units, on different clouds, you still have data scientists, business analysts, data engineers, machine learning engineers in every organization and the skills and the tools that they are used to, and that makes them productive. They are the same regardless of the business unit that they work on. So you are looking at a platform that should support those kinds of open interfaces, open standards that enables them to work on any platform and a common portfolio of tools for each such data practitioner. From a data platform perspectives team. You’re looking at having common access control, security and governance across all of these business units. You wouldn’t want your policies to be different for different units and hence you want to put in place technological constructs in your platform, which allow you to implement those policies across your organization in a consistent manner.
And similarly, since you have a centralized team… you may have a centralized organization who’s paying the bills at the end of the day for cloud platforms. You want common visibility into the usage costs and telemetric. That is emanating from the platform that is running for these business units in different locations in different clouds. So this is one such business in the future, and you can imagine yourself in that position. Now I want to highlight another vision of the future, where we have a similar business operating in a slightly different manner. In this case, this is an integrated multi-faceted business. Now this is multi-faceted because this business has a value chain that has various parts, and various parts of your organization might be focused on different parts of the value chain producing business value for the organization in each part, in this case, we see there’s a part of the value chain in this organization, where there are business systems and points of sale solution applications running on a cloud.
In this case Azure. Of course these applications are producing their own data about customer transactions that are happening. When this businesses customers are buying products that are sold by this business in various locations. Now move forward in that value chain and at some point you reach the other part of the value chain, where you as a business are producing advertising, you’re marketing to your users. You’re, running promotions and offers for your users and hence you need an ad platform. And in this case, the business has chosen to run the ad platform on, on Google cloud and of course, as you’re running on Google cloud and you’re running these massive advertising campaigns and promotions and offers you producing data, which is related to your advertising and your getting insights from your advertising campaigns that you’re running against your users.
Now, obviously when you’re running advertising campaigns against your users, you want the effect of it. You want for it to result in the end for purchases that are happening on the business side of your organization. Now this should lead you to, that final point of this slide, which is as an entire organization, it’s great to work on different parts of the value chain, extract value and do the best for that particular business, but as an organization there’s an untapped value where you can integrate parts of your value chain and extract more value and create more value for your business.
For instance in this case, wouldn’t it be exciting if you could take the advertising insights that you’re getting, and you’ve done that over and give it to the other side of the business, which is on the transaction side and you think the transaction data where customers are actually buying things from you, and you feed that back into advertising.
So you can make your advertising more personalized, more targeted towards your user, so that you go back and you, and it results in transactions that are happening for your individual users in a more effective manner. So you can see that virtuous cycle that starts up in the organization when you’re in a position to extract data and insights from different parts of your value chain, which may be running for many different reasons and many good reasons in different clouds.
Now, in this vision of the future, we get another set of requirements for a data platform to support such a business. In this case, you need secure data access across clouds because you’re accessing data from different parts of your value chain from different clouds. Now, the data sets might be huge and transferring them from one cloud to another cloud has physics. It requires time, and it requires money. There egress cost that every cloud imposes on data being egressed from any cloud.
So you want to build a platform that optimizes for size of data movement across clouds. You don’t want to be in a business of moving massive amounts of data to the cloud in order to extract these insights. In fact, you want to take it a step further you on the platform to be smart enough, to figure out where to run a given query or a given model so that it is closest to the data that it is operating on and it is smart enough to figure out and collate data and bring it back together to give you the insights that you need across those clouds.
So I’m hoping with this picture of two similar, but very different businesses operating at scale in future, you start asking yourself the same questions as to where your business is going and what are the things that you need in place today, in order to do that [inaudible] from your data platform. To summarize these requirements, there might be a hundred such requirements that come out, but I wanted to kind of highlight three really big ones so that if there’s nothing else you take away from these, these visions of the future, you think of with these three, that you kind of center around.
One, you need open and standard interfaces in the multi-cloud data platform that you want to leverage. That enables your organization to work seamlessly easily across the board. You want common tools, access control, security and governance. These are bedrocks of providing a seamless experience for your users, your data scientists, business analysts, your data engineers, machine learning engineers, all of them to make them productive you need common tools and common plan and a common platform.
And lastly, you need to make it easy for these various users to have access to the data assets across your organization, across the roles that they might have in your organization and across the clouds that your organization might be operating in. You need to make the data access to the data assets really simple, and easy. With this we’ll turnover, and we’ll look at some of the architectural underpinnings and the best practices that emanate in order to implement these requirements in your data platform.

Carleton Miyamo…: Hello, I’m Carleton, I’m the Tech Lead for the Enterprise Platform team at Databricks. Today, I want to extend the part of what was introduced about multicloud. Let’s look more deeply at what multicloud actually means for your data architecture. Multicloud requires unique design considerations. If you’re working on a hybrid, on-prem slash cloud environment or are fully on the cloud, supporting multiple regions, you’ll recognize some of these issues. However, there are additional factors and considerations to build upon when you introduce additional cloud vendors.
To better understand what multicloud means, let’s look at multicloud requirements in terms of three W’s who, what and where. First is who. Who am I to my application? How is this identity used in a multicloud environment, after all identity determines the services, the data and the cloud resources I have access to. What? What data assets do I have, ingested data, derive data, stream data, third party data. What models are being run and what data do they produce and consume. In addition to data assets, what cloud resources do I use? And am I using them efficiently?
Finally, where? Where are my data assets? Are they located where they can be used without excess copying? When I do move data, am I following compliance rules? These are the questions to ask when designing a multicloud project. So first let’s take a look at the role that identity plays. All of you have daily dealings with identity. Identities are handled by an identity provider or IDP for short. The IDP manages users within your organization, that lets you log into your company services like wikis or messaging.
It maintains groups, roles, and permissions for access to datasets non-user identities like headless users or service principles for automating workflows are also part of it. Now, there are many IDPs out there and companies often choose a cloud based one. This includes providers like Okta and OneLogin. Google has DCP Identity Platform and Microsoft has Azure Active Directory or AAD for short. When an employee wants to access some company service, they will log in and the IDP generates access tokens, which then get passed in to authenticate the request with the service. The IDPs of some cloud providers, have an additional responsibility besides access to your company services.
They also control what cloud resources you have access to. For example, to get access to an Azure ADLs storage, AAD will issue a special access token to enable you to call a storage API. However, there’s a problem. There is no standard to access cloud resources from an IDP. This means that IDPs are not interchangeable. For example, Azure. Let’s take ADLs as an example again, say you want to access and ADLs storage in Azure from your spark application, in order to access your data, you need to present an AAD access token to ADLs. GCP and Okta access tokens will not work, even though all the access tokens follow the OAuth2 standard. The token must have come from AAD in GCP there’s a similar problem. Google identity tokens need to be used to gain access to GCS, AED and Okta tokens are not accepted.
Now, if you look at non-user identities like service principles, you actually ran into a similar problem. AAD has serviced principles and managed identities. AWS has I-A-M-E-R-Ns. GCP has service accounts. They’re also not interchangeable and they’re managed only by their cloud IDPs. Since a common use case for non user identities is in API calls. Automating your data pipelines in multiple clouds will require you to manage many types of non-user identities.
So if multicloud is in your future, at least for now, you need to plan on integrating with each cloud providers IDP. There is some support from the cloud providers. GCP offers an identity Federation feature to wrap your IDP and get a GCP access token. Similarly, some third-party IDPs like Okta have support for integrating with AAD. This is an area where you need to work closely with your IT departments on what would work for your situation, and finally has always sticked to the open standards that are universally accepted, SCIM, SAML, OIDC, OAUTH2. It will simplify any integration with any IDP.
Now we move to the what? Many organizations do not know their datasets as well as they should do. Do you know all of the data sets you own? Do you have schemas on all your datasets? What if I asked the same questions about all your files that contain derived datasets, if you need to modify dataset or remove old files, do you know who owns it? Do you know who to reach out to? And if you did make changes, how sure are you that something won’t break or some model would silently produce bad results? Next, do you know all the cloud resources you use? Storage, networks, VMs. Each cloud has its own management APIs. As we saw with identity, management means a separate integration with each cloud provider. Again, it all adds to the complexity of working with multiple clouds.
If you think about it, much of what I said here is not really multicloud specific. You should be doing all of this, whether you are using multiple clouds or not. However, in a multi-cloud world, it becomes more important and we’ll get back to why this is the case later in the presentation. For now, let’s look at how we can address the problem. There are many ways to gain a better understanding of your datasets. One option to consider, is to make sure you have an up-to-date and correct inventory of your assets, actually process should be periodic and automated as this will help to ensure consistency and correctness over time. As part of the process, make sure to have owners for each data set, a schema definition as well. This is where Metastores can really help you to track everything and they will also give you programmatic access to the inventory.
This is why they are so important to any situation for this problem, for your models and jobs. As with datasets, find owners and also track dataset, consumers and producers. This will help you to rationalize what happens when data sets change and prevent breakages. For cloud resources, again assign ownership and track metadata. As you can see, there’s a common theme here. An additional benefit to doing this, is you can start to generate automated tooling around resource management. For example, you can write tools to integrate your inventory and metadata with AWS CloudFormation and Azure ARM Templates to automate your deployments. There are many ways to address this problem. These are just a few that will help you.
Finally, let’s discuss where? Where data resides will have a significant impact on your costs. Especially if you’re accessing data inefficiently. Cloud providers will charge you for moving data between regions and the multiple over that for data egress out of their clouds. Also something that’s always considered, moving large files takes a long time and can actually result in data loss. There are many times where I’ve seen a foul copy across regions, fail or perform a partial copy. If you’re not very careful about error checking, these can easily go unnoticed. So it makes sense to carefully plan where data should reside.
In addition to technical issues, there are also compliance concerns, both from company policies and governments care needs to be taken. That data is not allowed to move, where should not reside. Also datasets and any derived datasets need to be found in cleaned up. GDPR and CCPA are examples of government policies you should be aware of and how they impact your data. You will need to ensure that you can meet these requirements and be able to show reasonable proof if audited. So how do we take into account location awareness when designing our systems? Getting back to why understanding your data assets is important. Remember we said, we would get back to that a few slides ago. The first step is understanding your data assets and how they are produced and consumed. This lets you plan around optimizing your data placement for efficient use. Next compliance concerns.
As you’re putting together your schemas add compliance annotations. Annotations for PII, highly confidential or data that should be geo-fenced, will let you review and more importantly, automate your policy checks going forward. Finally, look at your models. Look at your real time queries. See if you can optimize the algorithms to reduce data egress. You may decide to use a different model for generally derived data sets. This decision is a trade off between increased compute to build the derived data sets versus the amount of data that needs to be sent across clouds. Consider your specific use case to see what trade-off works best. Also look into areas like federated learning and see if you can apply those concepts to your models.
So let’s take a closer look into how optimizing for location can affect your models. In this example, data ingestion is occurring in all clouds. Think about ad impressions or click events streaming in, or it may be an ETL job from Database. It could be regional sales and marketing data being pulled together for forecasting or business development. A straightforward way to add multicloud support is to simply copy or stream all the data to one cloud. Run your model to create the derived data set, then serve it. This requires all the data to egress into one cloud. And as we talked about earlier, this may be something you want to avoid.
Instead, consider if the model can be split into a local model and a federated model, the local model computes a derived data set based on the cloud local data, the federated model then combines the data. The assumption is that the derived data sets are much smaller than the total ingested data and the cost of the additional compute to create the derived datasets is less than the data egress cost out of the cloud. The same considerations also apply when running ad-hoc queries such as generating reports or performing some data analysis. Evaluate if the query can be split also looking into pre creating derived datasets, if the queries are run often enough to justify the cost. So if there’s one takeaway from this presentation, when implementing, multicloud remember to ask yourself the three W’s who, what and where.

Sumit Mehrotra: So I’m hoping our discussion on architecture, gave you a few best practices to start thinking about, as you think about your data platforms, your multicloud data platforms. Now I want to talk about, in reality, we are implementing such a platform at Databricks. We call it the Databricks Lakehouse Platform. I wanted to give you a little bit of a primer on what we are doing. Aligned with the way we have heard our customers talk about multicloud future, and the way we see the multicloud data platforms evolving. And to start off, it’s not just me, but our entire organization is thinking about it, including Ali Ghodsi, who’s our CEO. And he strongly believes in the statement. In order to give you an overview of the Lakehouse platform, I want to touch upon just four high level things, which should start, to help you align the way we are thinking about the Databricks itself and how multicloud data platforms are evolving.
This is Databricks in a nutshell, we are a Lakehouse platform, which means this is a platform on which you can bring together data from all different sources, regardless of whether they’re structured, semi-structured unstructured or streaming data and this is a platform on which you can have all your users, regardless of whether they are data scientists, data engineers, business analysts, machine learning engineers, all of them can work on the same platform on the entirety of the data that is available in your organization in one place and we call it the open data lake and the lake house paradigm. There are a couple of things that are really important to keep in mind as you kind of start to understand what the Databricks Lakehouse Platform is.
First and foremost, it’s available on all major clouds. That’s where kind of the basics of multicloud data platform starts. You have to be available on all the major clouds. Otherwise, you won’t be a multicloud data platform. Databricks is available on AWS, Microsoft Azure and Google Cloud. We have built Databricks with the core of visibility, control security and governance built right into the platform, regardless of whichever cloud you’re using. So the constructs we were talking about where in your organization, you need common mechanisms for control and governance across your organization, regardless of where your applications and services and data are. We have built that right into the platform. So it’s an underpinning of our platform.
Databricks is built on open standards and interfaces. You may have heard of Spark, Delta, MLflow. These are platform constructs, these are platform components that are central to Databricks is Lakehouse platform and these are all built on open standard interfaces. So regardless of whichever cloud, you intend to build your data applications on, you get the benefits of the same interface and the same platform. And lastly, in terms of experience, Databricks provides the experience to all personas, all data practitioners in your organization in a very similar fashion, regardless of which cloud you’re operating Databricks on.
This has immense advantages. As you think about scaling the skill sets that you need in your organization to really leverage the power of data in your organization. If your users can learn the skills and apply the same skills, regardless of whichever part of their organization, they are in and whichever application they are working on, and whichever part of data that they are working on, it gives a huge boost to your productivity considering they don’t have to relearn the skills as they as operate in the scale. So these are four important tenets of the Databricks Lakehouse Platform, which is a multicloud data platform. And now we’ll look at Databricks and action to crystallize some of these things.

Carleton Miyamo…: Let’s imagine that I’m a tech lead for project in my company, practically we use multiple sources of data from various teams across your organization. The plan is to combine features for movie rating data owned by our team, European soccer data owned by the sports division and ad click data owned by our ad agency to improve sales forecasting for one of the business divisions. Now the ratings team is in the Northwest office and they use Azure Databricks. They have collected data sets from partner sites, and these have been uploaded into the Databricks DBFS file system. So let’s take a look at them.
Here you see a bunch of ratings, data that have been uploaded and schemas for that data. Now, if you go back here, you’ll also see a bunch of derived datasets around titles. Let’s take a look at some of the workspaces and notebooks that create those derived datasets. We go to users and look at one of the notebooks, here you see a bunch of scalar code and SQL code that creates those datasets. So in addition to that, there are also jobs that run daily, that will process data that was received every day around the ratings and here you see the jobs, there are two of them running and they will create additional tables as data comes in. So that’s what’s happening within our team. Now, the ad agency that we work with is in New York and they use AWS Databricks. Let’s see what they have.
So as you can see, they have their own set of notebooks and their own set of data and these are all residing in another cloud. Now, finally, the third part of the equation is the sports team and they are using GCP Databricks. So let’s look at what they have, and here again, you see their notebooks and their datasets. So now as a project lead, I want to do a technical project with you to see how all the pieces fit together. Having an inventory of all assets from all the workspaces will help to do this. So I wrote this script. Let’s take a look at it. It uses the Databricks API to fetch acid information to my local machine. So I can conduct my review. We go through the script, it starts off with some utility functions and then there were a bunch of functions that grab data using the Databricks API. Here, it’s grabbing a list of DBFS file assets, jobs, notebooks, clusters, and instance pools. Finally, if we scroll down to the end, it’s building a complete inventory and writing the output to a file so that I can reference it later.
Let’s go back to the command line and let’s actually run this. So as you can see here, this is running against the Azure workspace and it uses an API token that I had pre-created so that I have access to the Databricks API. So in the summary, you see that it found a bunch of files, jobs, notebooks, and the notebook data contains much of the source code. So let’s take a look at what the output was. And here you see a bunch of data about the assets, the instance pools that I have, the notebooks along with their source code, as well as a list of all DBFS files that I’ve loaded into the workspace and the jobs along with when the jobs run you see the chronic expression and the list of clusters and DMs that I’m using.
So this is great, but this is only Azure. What about the other clouds? So let’s run this script against them as well. So remember that this is the same script, same APIs that are being used to grab the data. And once all of these run, then I should have a complete set of assets from all of the clouds. So here AWS has finished. You can see it’s found a bunch of files and notebooks, and now GCP has finished and taking a quick look at the files. You can see that, yes, they all there, they all have data and now I have the data I need to do my technical review. So that’s a small glimpse of what the Databricks platform can do as you consider your multicloud options. I hope the demo got you. Thinking about how you manage your assets, thank you for attending, and I hope you enjoy the rest of the summit.

Sumit Mehrotra: Well, I hope you enjoyed this presentation on multicloud data platforms and how we at Databricks are thinking about it. We are super excited about the future of multicloud data platforms and we live and breathe it day in, day out. And we would love to hear from you, get your feedback and questions and love to partner with you as you get along this journey and see how we can help you.

Sumit Mehrotra

Sumit Mehrotra

Sumit is the Director of Product Management at Databricks, responsible for product strategy, roadmap and execution for multi-cloud Administration, Devops and Growth areas of Databrick’s Lakehouse pl...
Read more

Carleton Miyamoto

Read more