Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks

Download Slides

The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software. I will explain some of these challenges based on my experience at Databricks, a startup that provides a data analytics platform as a service on AWS and Azure. Databricks manages millions of VMs per day to run data engineering and machine learning workloads using Apache Spark, TensorFlow, Python and other software for thousands of customers.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– I spark and summit and AI attendees. I’m glad you could join me for my virtual talk. Today I’ll be you talking about some of the lessons we learned from building Databricks, large scale multi-cloud data platform. First, an introduction, I’m Jeff Pang and I’m a Principal Engineer at Databricks in our Platform Engineering team. Databricks helps data teams solve some of the world’s toughest problems. And the Databricks platform team’s mission is to provide a world class multi-cloud platform to enable us to expand fast and iterate quickly so that we can continually improve our product for data teams. If you’re interested, you can find more information about our platform team at databricks.com/careers.

As you may know, Databricks was founded a few years ago by the original creators of Apache Spark. We provide a data and AI platform that now serves more than 5,000 customers in a variety of industries. And we’re still startup but we’ve grown to over a thousand employees, over 200 engineers and now have over 200 million annual revenue and we’re still growing.

Our product

Our data platform provides unified analytics and AI to data scientists, data engineers, and business users. It includes a data science workspace that integrates tools like notebooks, emo flow and tensorflow. It also has a unified data service built on the best in class deployments of Apache Spark and data Lake. And it’s built on top of an enterprise cloud service. that’s simple, scalable, and secure because we manage all the operations for our customers.

So what’s in this talk, first I’m gonna take you inside our a unified analytics platform and show you a bit about how we architected our multicloud data platform. Then I’ll describe three challenges and lessons that we learned while building. First, how we grew and evolve our data platform over time. Secondly, how to operate successfully in multiple clouds, and finally how we use data in AI to accelerate the data platform itself.

But first let’s look at the architecture of the Databricks data platform.

Simple data engineering architecture

So in a simple data engineering architecture you might use a single spark cluster to process your data from your data Lake which could be data stored in S3 or HDFS for example. You might take this through bronze, silver, and gold stages of a data processing pipeline to refine the data for use. And finally you’d have analytics and reporting applications on top of this data to actually provide that value for your users.

Modern data engineering architecture DELTA LAKE

Now, most modern data platforms actually involve many different data pipelines not just a single one. They would involve streaming data in addition to your data Lake, it might involve scheduling more complicated workflows where the output of one job is processed by the input of another. You probably use a modern data format like Delta and your applications would involve streaming analytics, notebooks, machine learning, AI, and probably more. And to do all these things, you probably need many different spark clusters for scale and to manage the spark clusters you’d probably need some cluster management system like mesos or Kubernetes. This is the type of architecture that main organizations are using or trying to build in order to get more and more value out of their data.

Multiply by thousands of customers…

The Databricks data platform provides this for thousands of our customers. We manage a control plane which handles the analytics, the collaboration, AI workflows, cluster management, reporting, business insights, security, and more so that our customers don’t have to manage that themselves. And we manage many spark clusters in our customer’s preferred networks to process the data from all of their existing streaming and data lakes.

many regions…

Furthermore, our data platform is deployed in many regions throughout the world because we have customers deployed everywhere and it’s critical for the data platform to be close to the data.

be on multiple clouds..

And because our customers have their data in more than one cloud, we replicate these deployments in several clouds and integrate with the best features of each of these clouds such as AWS and Microsoft Azure. The Databricks data platform base is a global scale multi-cloud platform that manages the data platform for thousands of customers around the world.

> millions of VMs managed per day al

As a consequence, the Databricks data platform manages millions of virtual machines on spark clusters every day. As you can see from this graph of VMs managed per day we anticipate that the scale of our data platform will continue to grow pretty rapidly.

That’s the Databricks control plane

That’s the Databricks control plane data plan, the control plane of our data platform in a nutshell. And that’s what will be the subject of this talk. Our control plane has hundreds of thousands of users, hundreds of thousands of spark clusters every day, millions of VMs per day, and processes exabytes of data each day. Our data platform supports everyone from university students just trying out spark for the first time in our free community edition to fortune 500 companies with thousands of users and many complex workloads. In the remainder of this talk I’ll discuss some of the big lessons that we learned from building this large scale multi-cloud data platform.

The first challenge I’ll discuss is how we grew our software as a service data platform over time.

Evolution of the Databricks control plane

As a startup, we obviously didn’t start out with a global scale multi-cloud data platform. In fact, we started off with something pretty small. The challenge was how to grow a data platform that provided a lot of value for one customer to thousands all within a span of a few years.

Since we manage the data platform for our customers and we wanted to keep expanding the capabilities of what our customers could do with it. We learned that the data platform itself isn’t actually the most important thing actually, it’s the factory that builds and evolves the data platform its more important than the data platform itself. Because that allows us to rapidly provide more value and support more use cases in our data platform over time. And this has happens all in a way that’s transparent for our users.

So what were some of the keys to success in building a great data platform factory. First, we needed to quickly deliver value to the market. For example, we provided a version of one of our data platform which included some analytics tools like notebooks. But very quickly had to provide a version two which included more features like job scheduling and management. The challenging quickly iterating on our data platform is that our users are using and relying on it at the same time that we’re trying to change it. So we need to be able to make sure we don’t break things while we’re continuously changing this data platform. The keys to success in establishing this virtual cycle, are our modern continuous integration infrastructure, fast developer tools, and a lot of testing. For example, at Databricks we make heavy use of Scala, Bazel and React JS for our developer platform. We spent a lot of time optimizing our Scala builds to get up to 500 X speed-ups compare it to default tooling for example. And we run tens of millions of tests every day, to make sure that our changes don’t break things. Our developers also create hundreds of Databricks in a box full control plane environments to the every day so that they can develop test and demo new features. This allows us to keep adding new features or data platform very rapidly.

Expand the total addressable market

Secondly, we need to very quickly expand the total addressable market by replicating our control planes in many environments. The challenge is that each environment is slightly different. There are different clouds, different regions and different data governance zones that require different configurations. So this quickly becomes very complex if we want to support all these different environments and iterate quickly to update them with the latest features. The keys to success to expand quickly for us was to focus heavily on declarative infrastructure where we rely solely on templates that describe our different control planes and use a modern continuous deployment infrastructure to deploy them.

At Databricks we use a templating language called jsonnet and Terraform and Spinnaker to form our deployment pipelines. This allows us to express over 10 million lines of configuration for example and many fewer lines of code. And it allows us to deliver new features in our data platform globally several times each month.

Land and expand workloads

Finally, we wanted to expand the workloads that our customers could run on Databricks after they’ve adopted our data platform. This meant that we needed to expand the scale at which our data platform could operate and simultaneously scale the rate at which new features could be added to the data platform over time. As more and more of our engineers developed features for a data platform we needed to make sure that they weren’t reinventing the wheel and building everything that everyone else had already built. The first key to success in scaling quickly within customers who had already adopted our data platform was to have a service framework that did much of the heavy lifting for all data platform features. Things like container management and replica management, APIs, RPCs, rate limits, metrics, logging, secret management et cetera. These are things that all features and all teams that Databricks need in order to build their features from notebooks to SQL to ML flow. These are all pieces of our data platform but they aren’t really core to their functionality. The second key to success was to decompose our monolithic services into microservices. For example, when we first started our cluster manager service which manages all our spark clusters was a single machine and it just managed a few hundred DMS for our customer. But to support the millions of yams that we manage today we needed to break it apart into different core functions so hat it could scale. A well rounded service framework was the key to doing this rapidly so that we could spin up new services with different functionalities very quickly.

The Databricks data platform factory

So in summary the Databricks data platform factory allowed us to iterate on our data platform very quickly, rapidly replicated in many regions and many clouds and expand the scale and breadth of workloads that it could support. One of the great things about having a solid platform factory is that we often use it to improve the factory itself as well. You can see from this diagram that we use a lot of cloud native open source technologies that Databricks everything from Envoy and GraphQL for RPCs and APIs to Kubernetes for container management and Terraform and Spinnaker for deployments. But we didn’t actually start out with all these technologies back when Databricks first started many of these open source projects didn’t even exist yet, but by taking a factory approach to our development process, we continuously retool our factory incorporate and extend the best tools out there over time. Essentially in using our factory to improve and refine itself in addition to our data platform.

Next, let’s talk about how you’re on Databricks on multiple clouds.

Databricks runs on multiple clouds including AWS and Azure. Why do we do this? Well, because the data platform needs to be aware where the data is. Putting the data platform in the same data center as the data impacts performance latency and data data transfer costs. It allows us to integrate with the best features of each cloud and allows us to adhere to the data governance policies that our customers demand from our desired cloud. The biggest challenge in supporting multiple clouds is to do so without sacrificing the velocity of feature development or the data platform itself. As I just mentioned iterate and quickly with the key to extending and expanding the data platform. And we didn’t wanna sacrifice that. What we found is that a cloud agnostic platform layer is the key to maintain developer velocity, but this layer also needs to integrate with the standards of each cloud and manage their quirks. So the next slides I’ll discuss what I mean.

Challenge: dev velocity on multiple clouds

The developer experience on multiple clouds is pretty divergent. So it’s pretty hard to build the same thing on each cloud. For example, many cloud services have no direct equivalents. For example in AWS, AWS has a great scalable key value document store called DynamoDB. But the same interface really doesn’t exist in Azure. Cloud APIs also don’t look anything like each other, even the authentication and access control on each cloud is very different. Operational tools for each cloud is also very different and you can’t even consume logs in the same format. This makes it pretty challenging to support multiple clouds without doing a lot of extra work.

Approach: cloud agnostic dev platform

To overcome this we built a cloud agnostic developer platform to support our data platform. Between these blue lines here, our services that make up our data platform, things like SQL, notebooks, job scheduling, emo flow, cluster management, et cetera. To make sure that these services can operate seamlessly in multiple clouds. We developed a service framework API that’s common across the different clouds that we support. This API builds on a few lowest common denominator cloud services that are pretty similar on each cloud. Things like virtual machines, networks, databases, and load balancers. When it Databricks engineer works on a data platform service like emo flow for example, they would use our service framework API to manage APIs, RBCs users, permissions, billing, testing, and deployment for example, They would write the code once and they want you to worry about the differences in each cloud they will just work.

Challenge: not everything can be cloud agnostic

Now we learned that not everything can be cloud agnostic. No matter how much we try to abstract away the details. First of all customers actually want to integrate with some of what is best about each cloud. And we want to support the standards of each cloud. Second, even lowest common denominator cloud services like BMS have implementation quirks that reveal themselves once you start building on them. And therefore they’re not exactly identical.

Approach abstraction layer for key integrations

To handle integration with the standards of each cloud our service framework also acts as an abstraction layer for the differences in each cloud that we have to integrate with. For example, the ways that each cloud do authentication authorization, key management, billing, and storage are pretty different. Yet most data platform features need to deeply integrate with all of these properties. So to harmonize the differences in each cloud, our service framework API abstracts away the differences as much as possible so that a service and a put data platform feature like emo flow just needs to know about our service framework notion of authentication and key management rather than that of each cloud. So for example, it wouldn’t need to know about like KMS for Azure key vault. You’re just know about our own notion of bring your own key for encryption.

For example, consider storage. Storage is a very special example because both S3 and Azure data Lake are the most common and cost effective ways to store a large data objects in your cloud. And typically the storage layers typically form the basis form data lakes that most of our customers use. However, S3 is eventually consistent, whereas Azure data Lake is strongly consistent. So they have pretty different semantics. That is after writing a file to S3 the next read of the same file isn’t guaranteed to see the same data whereas it is an Azure data Lake. Obviously having each application deal with this semantic difference would be a pretty big headache so we abstracted away this difference with an S3 commit service which basically makes interaction with S3 as strongly consistent as Azure data Lake there by harmonizing the data access layer for our data platform factory services.

Approach; harmonize “equivalent” cloud service quirks

The abstraction layer to handle cloud standards is necessary but it’s not sufficient to make our service platform clouding agnostic. In reality, even the so called common cloud services have quirks that are a pretty big nightmare to deal with. If you just try to lift something that’s working in one cloud and try to make it work in another. For example, consider virtual machines with tools like packer it’s pretty easy to actually create the exact same image in each cloud. So you can actually launch the exact same binary and its running the exact same code, exact same operating system, et cetera. However, when using BMS as elastic compute the creation time and deletion time of those VMs matters a lot to the user experience. For example, when you’re trying to spin up or tear down spark clusters. And if you’re not careful about adopting to cloud limits in APIs, you might end up creating and deleting VMs faster than the cloud can return the quota to you. And you might end up in a deadlock state where you can’t actually create more VMs, even though you think you have more quota. And this is one of the examples of the experience we ran into that we had basically had to work around in order to make sure that we can create VMs and elastic spark clusters successfully in both clouds.

Similarly, we found that different clouds handle the humble TCP connections very differently. There are a lot of invisible middle boxes that will time out and close your connections without warning. There are small differences in data sender network hardware reliability that can cause pretty catastrophic application failures if you’re not aware of them. For example, we found that in 2019, there was a major TCP bug in the Linux kernel. And basically what this did is that it caused some TCP connections to hang forever. But we really only experienced this bug in one of our clouds because of very small differences in reliability.

Finally, even though you can run things like MySQL and other databases and all clouds. It isn’t exactly the same. Things we’ve thought we found we thought would be trivial like differences in case sensitivity or insensitivity of the underlying operating system. For example, whether you’re like files can be case sensitive the named or not can reach into the database and bite you if you aren’t careful.

So to make a truly clouding agnostic platform layer. We are continuously sussing out these quirks and masking them from our everyday developers so they don’t have to worry about them.

The final lesson I’ll talk to you about today is how important it is to use data and AI to accelerate the data platform itself.

nception: mproving a data platform with data & Al

It actually is really hard to build and operate a data platform without data and AI. You actually need a lot of data and AI to actually build a data platform itself. Thus Databricks is actually one of its biggest customers. And we use Databricks very heavily. A data platform needs data from many things, it this data to track usage, maintain security, and use data to observe users of the data platform and approve it for them. And it needs data to keep itself up and running. And we found that having a simple and powerful data platform like Databricks has been essential to building the data platform itself, thus data and AI accelerates the data platform.

We use Databricks for many things and this includes key platform features like usage and billing reports where we want to deliver actual reports about customers usage to them. And also our audit logs, where we need to develop deliver a security logs for our customers so they can figure out what their users are doing. We use Databricks for analytics and for future usage to look at trends and to model growth, to look at churn and other forecasts. And these are some of the same features that our customers are also using Databricks for. And obviously we’re also a data customer because we use a lot of data’s and analyze our own product. And finally a data platform itself is a managed service and a managed service needs realtime data to operate effectively. And this is for things like mission critical devops, and monitoring observability so that we can gather and analyze everything from KPIs to APIs and other logs like Spark debug logs, which are necessary both to help our customers to debug their workloads and also to the debug problems with the platform when there are issues.

Out distributed data pipelines

So to do this all we’ve built several globally distributed data pipelines with the Databricks data platform at its foundation. Like our service deployments we use declarative templates to deploy our data pipelines so that we can rapidly iterate and replicate them. We have a templating tool called the stack CLI, which you can find in our documentation that you can use yourself deploy these data pipelines at Databricks or in Databricks. Each of our globally distributed Databricks deployments streams basically stream data from ETL data into our data processing engine which is built on top of Databricks and Delta. But it also tightly integrates with other big data tools like Prometheus’s elastic search and Jaeger for distributed tracing. This allows us to see analyze and model the usage and help of the Databricks data platform in real time and historically. Right now our data pipeline processes hundreds of terabytes of logs per day, and it analyzes millions of time series in real time. This data pipeline has really been essential to the continued evolution of the data platform that is Databricksn that it’s actually built on and thus Databricks actually accelerates itself.

So in summary, the Databricks architecture manages millions of BMS around the world in multiple clouds. We’ve learned several important things while building this massive data platform. First, the factory that builds and evolves the data platform is more important than the data platform itself. The data platform of the future will not look like the one today. And the factory is what will build that data platform of the future.

Second, a cloud agnostic platform that integrates with each cloud standards and quirks was the key to building a multi-cloud data platform. And it’s harder to do than it first appears because of the performance scaling integration needs of big data.

Finally, data and AI can accelerate the data platform, features product analytics and devops. You can’t really build a large scale data platform without actually analyzing a lot of data.

And that brings me to the end of my talk and thank you for listening. Oh, and PS if you’re excited by anything in might talk today, Databricks engineering is hiring and I hope some of you out there will come to join us.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Jeff Pang


Jeff Pang is a Principal Engineer at Databricks on the Platform team. He has lead the development of several parts of the Databricks Unified Analytics Platform, including Databricks Notebooks, Databricks Community Edition, and Azure Databricks. Prior to Databricks, he worked on large-scale data analytics at AT&T Labs - Research. He obtained his PhD from Carnegie Mellon University and a BA from U.C. Berkeley.