Building a Cloud Data Lake with Databricks and AWS

How are customers building enterprise data lakes on AWS with Databricks? Learn how Databricks complements the AWS data lake strategy and how Databricks integrates with numerous AWS Data Analytics services such as Amazon Athena and AWS Glue.

Speakers: Denis Dubeau and Igor Alekseev


– Hi, everybody. Thanks for joining us today for our session. Building a Cloud Data Lake with Databricks and AWS. I’m Bryan Durkin. And I wanna introduce our speakers today. We have Dennis Dubeau of Databricks, who is our manager of AWS Partner Solutions Architects. And Igor Alekseev with AWS, who is partner SA of Data and Analytics. Okay, With that, I will turn it over to Igor.

– [Igor] Thank you, Brian. Hi, my name is Igor Alekseev, and I’m a Partner Solution Architect at AWS assigned to Databricks. Today we’re going to talk about role of Cloud Data Lakes in modern analytics. But before we start, let’s look at what are the critical drivers for modern architecture. Increase in data volume. Companies capture more and more data. Data growth is exponential at this point, there’s also increased variety of data. Companies are capturing logs could be in JSON format, could be CSV format, could be structured data, could be unstructured, semi-structured data. Enterprises are increasingly using variety of data. There is an increase in use complexity of data. There’s machine learning workloads that use the data. There is complex analytics. There is also a variety of data access patterns. Mobile devices, streaming analytics, dashboards, traditional ad hoc querying and reporting. At the same time, users are no longer tolerate having access to data that is in silos. Data needs to be available in one common location. Customers want to have more data from their value. As I mentioned, data is growing exponential, but it comes from many new sources and increasingly diverse. And used by many people, people of many different roles. You have data engineers, you have data scientists, you have analysts and you have users who are only familiar with user interface driven data analysis. And data is analyzed by many applications. You can, for example, have different BI tools connected to the same data set. Different users may have their own tools that they use for data analysis. What is a data Lake? Data Lake is an architectural approach that allows customers to store massive amounts of data into a central repository. You can expect data to be accessed by machine learning, by different types of analytics. You should expect data to be loaded into data lake using the real-time movement, or you can have data coming from on-premises. The key is the data needs to be readily available to be categorized, processed, analyzed and consumed by diverse groups within an organization. Today’s data lake enable data science. But what can you expect from today’s data lakes? You should expect structured and semi-structured and unstructured data to be stored in the data lakes. You should be able to run analytics on the same data without movement. This is important in order to be able to scale your analytics. you cannot expect to move around petabytes of data. It puts stress on your network, potentially disrupting neighboring applications. You should expect your data lakes to scale storage and compute independently. This is important for asymmetrical situations where you have a long tail of retention data. But for example, using actively only a recent subset of data for machine learning but you wanna for historical purposes keep a longer tail of data. Schema is frequently now defined during analysis. It’s so-called schema-on-read. There is also expectation of durability and availability at scale. And you should expect also security compliance and audit capabilities from your data lake. Data lakes and analytics on AWS AWS offers open and comprehensive solutions. You should expect your data lakes and analytics on AWS to be secure, scalable and cost-effective. You should expect to be able to load data from on-premises environment, or you can expect movement of real-time data from streaming sources. You should expect data to be used in data lake on AWS to be also machine learning and analytics. There’s one service I would like to highlight in particular that underpins all of the infrastructure of data lakes on AWS. It’s Amazon Simple Storage Service. It’s a secure, highly scalable and durable object storage with millisecond latency for data access. You can store data in multiple formats, such as CSV or Avro, Parquet, JSON. It’s built for 11 nines of durability. It supports several different forms of encryptions. You can run analytics or machine learning workloads on data lake without any data movement. You can classify, report and visualize data usage trends. You can see variety of data includes images, videos, sensor data, weblogs or documents. AWS offers the most ways to transfer your data into the data Lake. From on-premises, you have ability to establish dedicated network connections using services such as AWS Direct Connect. They’re also secure offline appliances that you can use to transfer your data such as, AWS Snowmobile and AWS Snowball. There’s also Snowball Edge that includes… So these are regular ruggedized shipping containers that you attach to your network, load the data and ship them to Amazon data centers that will load the data onto S3. There’s also database migration service and AWS Storage Gateway that lets your application write directly to the cloud. Data movements can done also in real-time. you connect devices to AWS using IoT core. You can push data through the real-time data stream or video streams with Kinesis Data Firehose, or Kinesis Video Streams. If you are using Kafka, you can take advantage of Amazon Managed Streaming for Apache Kafka. Amazon offers you the most comprehensive and open portfolio of machine learning and data analytics services. If you can think about this portfolio as these four different layers. Top, you have data visualizations dashboard, Digital User Engagement services, predictive analytics. So for predictive analytics you can use SageMaker for dashboards, for example the cook site. Data movement we’ve talked already about database migration service. Analytics, you can use services such as RedShift for data warehousing. You can use server-less data processing with Lambda, interactive queries with Athena, operation analytics with Elasticsearch, real-time analytics with Kinesis. There also infrastructure and management services associated with that layer. You’ll have your S3, You have a Lake formation with security and management and a Glue catalog and ETL. And of course there’s a migration and streaming services. And if you think about these services, I’d like to think about Databricks as a cross cutting service that goes across all four layers. And it gives you ability to move the data into data Lake, do ETL, do analytics and train machine learning and publish and then produce data for you to power up your dashboards. And with this, I’d like to hand off back to Brian.

– Okay thank you Igor, and now I will turn it over to Dennis.

– Hi everyone. My name is Dennis Dubeau and I’m the AWS Partner and Solution Architect Manager covering the U.S at Databricks. I’d like to thank Igor for sharing his insights on the role of the cloud data Lake. But now, I’d like to switch gears a bit and describe this new Lake house data management paradigm that many of you haven’t heard lately. I first have to explain and touch upon on the history of the data warehouse industry to set the background. So since its inception in the late 1980s, data warehouse technologies have continued to evolve in these large MPP massively parallel system architecture, have led us where they can handle large amounts of data sizes really dedicated to decision support and business intelligence application. Now, while the warehouses were great for structured data, a lot of modern enterprises have to deal with unstructured data, semi-structured data and data with higher variety velocity and volume as well. Data warehouse are not really well-suited for these use cases and they are certainly not the most cost efficient. Then about a decade ago, a company began building data lakes. Large repositories of raw data in a variety of formats, many companies that invested in, Adobe systems as well, as you might know. While data lakes are very suitable for storing large amount of data in a cost-effective manner, they lack some of the critical features like the support of transaction boundaries, for instance. They do not inherently enforce data quality in the absence of transaction consistency and isolation makes it almost impossible to mix appends and reads and batch and streaming jobs. So what is a Lake house really? Well, I would summarize it as a new system design. Which is a basically implementing the similar data structures and data management features to those in a data warehouse while leveraging all the aspects and features of the data Lake. Let’s talk about the role the cloud data Lake. So really let’s dive a little deeper and really look into the promises of the cloud data lake. So cloud data lakes are great for storage since the file system can store a wide variety of data types as we just discussed. The data could be in any format, transactional, image, video, speech, weblogs, IoT data. Any other types of streaming data that can quickly be ingested and stored. They provide an open storage format, which means that they can choose and take advantage of the most commonly adopted file formats like Parquet, JSON, CSV, TXT and access that data via multiple applications or services. You’re also able to separate your storage from a compute so that you can store all of your data in a single location and only provision the necessary compute as needed to process. I mean, I can’t just really stress enough the store wants in a single repository for all application and just enough just-in-time compute resources needed. It also allows you to store many different types of structured and unstructured data sets I’ve referred to and at the lowest options available as well. All right, and of course many organizations will want to operationalize it’s data. So to do so they often time need sophisticated features that you would usually expect from a relational database management system. You’ll want features like asset transaction, for instance. So that you can either fail or succeed at the transaction level. Ability to point in time snapshots and also create an optimized indexes for very fast query access. Well, having all this benefit of a data Lake, like having the flexibility of a scheme-on-read, or even enforces a scheme-on-write and combine simplify and unify the reliable capability of streaming and batch processing. While keeping all of this in an open format with no vendor locking. So let’s talk about the cloud data lake blueprint. And we found that there’s our seven best practices that are universally well aligned with AWS and this topic we covered it extensively at re:Invent last year. Here there’s gonna be many of those topics as well, that are gonna be covered this year at re:Invent as well. So stay tuned on all of the advantages and new features coming up in those events. But first, let’s go through these set of topics here. So first you wanna be able to reliably process streaming and batch data as we’ve just mentioned, right. The second blueprint is you wanna make sure that you can transform to your data using an open data format such as Parquet. Which is a platform and independent machine-readable and specification is published to the community. Very open, no lock-in. Number three is to really prevent partial writes. Basically having the ability to either roll back on failures, to ensure integrity of your data. And so that you don’t end up having partially written data sets lying around in you data lake. Number four, you wanna curate your data and refine it in a series of steps, following a standard approach, really common well adopted framework in the industry. You also wanna be able to use an enterprise data catalog, where you can catalog all of your assets and really in a centralized, scalable and secure repository that all of your services and tools can also access this metadata. So number six is really as well, run your business with the most secure cloud computing environment with the proper governance and audit procedures. Finally, you’ll want to be able to distribute and publish your results to your downstream system and business applications to ensure the proper actions are followed upon. So let’s introduce the Delta Lake. First, we’re really proud to share that Delta Lake is an open source project hosted by the Linux Foundation project. And that’s also available if you follow So all of the details are available publicly as well. Delta basically uses as we’ve mentioned an open source file format, like Parquet. And we also provide a transaction log which allows us to provide the reliability and the performance of a data Lake while being compatible with Apache Spark APIs. So let’s take just a few minutes here just to really cover what I just said here, is really is there are two main parts to Delta. First, we version the parquet files on your S3 bucket so that we can actually perform all the modifications on your data. And secondly, we keep that transaction log of all these operations, which allows us to also store those transaction log on your S3 buckets. So it’s all leveraging S3 as well as leveraging parquet, as well as having that transaction log that provides us the reliability that you would expect from your data lake. So let’s go in and know why Delta? Basically, so we found a numbers of different challenges with data lake. Number one, Amazon S3 is a highly available, scalable and durable object store, but it does not support object locking. So basically as you are writing file content to a specific location, obviously that you have permission to you can’t really not enforce a numbers of call-ins on the data attributes when you’re writing to S3 buckets, right. So there’s no really sense of a schema enforcement. Secondly, you’ll definitely encounter data engineering timeline failures. And then so as you are running through the set of these challenges, now you have an unreliable state where you might have possibly written partial content that are caused by those failures. Now, you have to go through the manual process to remove that the data that’s written and then rerun these jobs or these pipelines. And the last challenge here is also, there is no awareness of transaction boundaries and that’s what you’ll find in RDBMS. So there’s no consistency enforced by the file system natively. So consequently, it’s very difficult and almost impossible to mix readers and writers to ensure consistency in object level without building some very complex pipeline. And also investing a fair amount of resources to do so. So that’s on the reliability front. On the performance characteristics of data Lake. You’ll encounter those very small and very big files. When you start streaming data, you’re gonna have a large amount of very small files. Which basically it’s gonna really affect your performance. The partitioning strategy is really a poor man’s indexing as you will now. So once you decide the partitioning, there’s no really ways to build very strong and efficient indexes. And there’s really no caching available as well on file storage. So let’s talk about how do we ensure reliability using Delta Lake. So, I touched a little bit upon that. So we do offer a numbers of features, and one of them is the transaction log that is written to your S3 environment and that transaction log sits on top of the parquet file. And really that’s what Delta Lake’s all about, right. So that transaction log allows us to really manage that batch and streaming data because we have the control over the transactions coming through. So there’s a set of open source features that we provide with the open source Delta Lake. And those features are the support of asset transactions, schema enforcement that we just talked about, ability to unify batch and streaming, as well as time travel and data snapshots because of the transaction log that we support. Now, those are all open source capabilities. Now, if you leverage Databricks and some of the commercial features of Databricks you’re provide as well on top of these open source features, the ability to create very complex indexes, and it will also provide auto compaction, also the ability to do data skipping in caching as well. So now that we understand the reliability and the performance features of Delta Lake, it’s very simple and easy to implement this commonly adopted industry framework, where you ingest your data type from your raw sources. In what we call this bronze layer. Now the bronze layers, there’s just one image here, but you could have multiple bronze ingestion files. And then from there, then you’re applying your filtering and your cleansing criteria to produce your silver layers. And again, those are many of the many. So many bronze could feed into your silver layer and vice versa. And what we found is that most of our organization are then typically granting access to your silver layers so that you can start building those ML pipelines as well, providing all the data assets with a minimal curation process to your data scientist. And then as you move that data from silver to gold, then you really aggregate those data where it becomes what we call the serving layer of the data. Which is that gold version where you’ll send down to your streaming analytics systems and your reporting and BI. So via the refinement process, you’ll incrementally improve the quality of your data until it’s really good for consumption by the serving end points, and also allow the evaluation of the schema enforcement and requirements throughout this level and framework. So the quality improves from left to right, essentially. All right, so there’s a numbers of integration point on the landscape and I wanna touch on a few of them. One of them is Glue. So let’s talk about the Databricks level and integration with Glue. And obviously as an enterprise meta store, it’s really easy to leverage Glue today with Delta. So you build your pipelines, you can store all your metadata and so that you can discover that data quality and the shared catalog assets as you’re building these different pipelines. Now, once you have this data catalog in your enterprise Glue meta store, it is really easy then to start consuming your data from Athena’s point of view as well. So it’s full integration with Glue and Athena via the Databricks Unified Data Analytics Platform. The next integration I’d like to talk about is the RedShift integration. So Redshift today supports the ability to query data. RedShift spectrum allows you to read delta lake tables directly, and we can also take the data from S3 and then push that goal data set serve directly into your Redshift instance, if you’re prefer to serve it this way from a dashboard or from a data warehousing workflow characteristics. So it’s really providing that ability to start serving data to your downstream custodian so that it can actually make the business decision on the curated data. So again, if we go back to the blueprint, that’s the number six item that we talked about, about distributing our results for line of business. So whether you’re using Athena or Redshift or another RDS type supplier, we have those tool integrations from a Databricks standpoint, the first part of the services on AWS. We also do this in a very cloud native enterprise solution. So we centralize all of our assets, leveraging all the identity access management, and as well as all the AWS services to secure this content. Whether it’s the data in transit, data at rest, or as well as preventing access to these different environments and following all the best practices in security and consistent governance and approaches. So in a nutshell, if we review what we’ve just talked about, we basically hit on all of those seven best practices. Now reviewing the seven items, all of these items are addressed either using Delta Lake on top of your data lake on S3 and integrating your catalog at the enterprise level, leveraging Glue and distributing your new data to your line of business. Whether it’s the RedShift, whether it’s through Athena and all of this done in a very secured and govern an audited processes. so a valuable integrated solution. So let me just kind of wrap this up here and provide you this overview of Databricks on the AWS landscape. Now, if you look at those four pillars, either from ingesting data to the data engineering layer, to serving the data and for the analytics to be consumed, we integrate with a number of first party services. Whether you’re looking at Kinesis Frame for streaming ingestion or the managed Kafka solution, as well as obviously ingesting data directly from an S3. Either this could all be done by other services like Appflow for instance or third-party solutions as well. So this data cannot be ingested within the boundaries of the customer’s VPC. So all of these database cluster instances are spun up in the clusters in the customer’s VPC, interacting with your S3 daily data lake directly. And then Delta lake sits on top of your S3 object storage. And as I said earlier, S3 Parquet file with a transaction log that provides you that reliability and performance that Delta provides you by interacting with your data locally. We also provide you a numbers of supportive spot instances, instance types as well that are available today on AWS. And provide this best of breed of separation of compute and storage. And all these integrations have been optimized. So whether you integrate with Glue or pushing data to RedShift, or Athena consuming data using the Glue catalog, as well as the integration with ML flow with SageMaker, those are all optimized and really full integrations to leverage. And again, using them to breed a services to deploy the end-to-end use case that are necessary to address the customer’s pain points. There’s a numbers of additional services that are larger in scope, like you would expect identity access management, cloud formation, cloud trail, AWS config, SSL. For instance, for the full single sign-on. So there’s a number of additional, I would say services that are fully integrated with the Databricks Unified Analytics Platform. And we recently launched the availability to deploy Databricks workspaces using AWS Quick Start. So I hope you all enjoyed the summit and make sure to ping me if you have any questions.

– Great, thank you both so much. For two things for follow-up. You can get free training on Apache Spark in Delta Lake using this Bitly here, and then also you can find case studies and integration details at Thanks everyone for listening.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Denis Dubeau


Denis Dubeau is a Partner Solution Architect providing guidance and enablement on modernizing data lake strategies using Databricks on AWS. Denis is a seasoned professional with significant industry experience in Data Engineering and Data Warehousing with previous stops at Greenplum, Hortonworks, IBM and AtScale.

Igor Alekseev
About Igor Alekseev


Igor Alekseev is a Partner Solution Architect at AWS in Data and Analytics domain. In his role Igor is working with strategic partners helping them build complex, AWS-optimized architectures. Prior joining AWS, as a Data/Solution Architect he implemented many projects in Big Data domain, including several data lakes in Hadoop ecosystem. As a Data Engineer he was involved in applying AI/ML to fraud detection and office automation. Igor's projects were in variety of industries including communications, finance, public safety, manufacturing, and healthcare. Earlier, Igor worked as full stack engineer/tech lead.