Creating a Lakehouse on AWS

May 26, 2021 03:15 PM (PT)

Ever wanted to get the low cost of a data lake combined with the performance of a data warehouse? Welcome to the Lakehouse. In this session learn how you can build a Lakehouse on your AWS cloud platform, using Amazon S3 and Delta Lake. Integrate with Amazon Glue and make the content available to all of your AWS services like Athena and Redshift. Learn how other companies have created an affordable and high performance Lakehouse to drive all their analytics efforts.

In this session watch:
Denis Dubeau, AWS Partner Solution Architect, Databricks
Igor Alekseev, Partner Solution Architect, AWS



Denis Dubeau: Hi everyone and welcome to our “Creating a lake house on AWS” session. My name is Denis Dubeau and I lead the team of Partner Solution Architect at Databricks dedicated entirely under AWS cloud platform. Today I’m joined by Igor Alekseev, from AWS and together we will show you how to build a lake house on the AWS cloud platform using a modern data architecture functioning under one common data foundation to unify, scale and simplify all the workload across your enterprise. So on that note, let me pass the control to Igor for a quick introduction and get us started.

Igor Alekseev: Great. Thank you, Denis. Hi, my name is Igor Alekseev and I’m a Partner Solution Architect at AWS supporting Databricks. Before we dive into the lake house architectures, I want to talk first about what are the trends that drive lake house architectures.
Customers are telling us that there are more and more new sources with data coming from sensors, data coming from relational databases, and this drives the complexity of existing data architectures. Also, there’s a variety of personas, different type of people want to use data lake houses and data architectures. For example, you might have an analyst, a SQL only analyst who wants to use your lake house architecture, or you might have a data scientist who prefers to use Notebooks, or you might have data engineers that are working on writing the data jobs.
There is also growth in open data formats, data formats, such as Delta Lake for example, streaming data is important as well. Customers want to analyze data before it’s loaded to reduce time to insights, also customer wants to have a single source of truth. They want to have a unified analytics that are underpinned by the same data and unified security is important because it simplifies the compliance when all your security controls are in one place. On the next slide, I’m going to talk about the architectures, the actual architecture and what it looks like on AWS.
So what is a lake house on AWS and how AWS use the lake house architecture? So the center of it is Amazon Simple Stories Service, S3, and we see there’s three motions with the lake house architectures. You have these services around the perimeter. We see the motion from outside in, it’s when the outside services are writing data to S3. You can think of this as migration of data to S3 or a continuous ingestion of data into S3, services like Snowball, Snowmobile, Kinesis Data Firehose and Kinesis Data Strings.
Also another motion is from inside out. Services such as Redshift or Amazon Kinesis also can get the data from S3, can read the data from S3. And the second motion is around the perimeter where services can integrate with each other to get the view of the data storage. For example, in Elasticsearch Service, you can use another quick site to get the data from Elasticsearch.
So what are the benefits of the lake house approach? You can store any type of data, data always stays on S3, the lake house scales, terabytes, and exabytes of data, because data’s stored on S3, the cost is low, you get security compliance and audit capabilities across your data lake, you empower all kinds of personas because you have this different services around the perimeter, you use the best fit service for your analytics needs.
This allows you to democratize machine learning as well, that way you can do it with SQL services such as Aurora has now support for a model inference and in many cases, you will not need ETL, also, it allows you to unify analytics across operational data warehouse and data lakes.
On the next slide, we’re going to dive into a common security use case, it’s an example of a use case that you can achieve with lake house. In this use case, you store your data on S3, you use IAM roles to manage the data. If your data is encrypted at rest, or you use KMS Service, and you use your Lake Formation Data Catalog to give access to your data, to different services, to services on the right, such as Athena, Elasticsearch and Redshift.
This is an example of how you can consolidate all your security needs in one place and make sure that you’re not missing anything. The next slide I’m going to talk about support follow-up and open source formats, such as Delta Lake. Delta Lake is a format used by Databricks. And Dennis is going to be talking more about this.
Currently Amazon Redshift and Athena both support the Delta Lake format. In this particular example, data is stored on S3. There’s a manifest generated and Amazon Redshift spectrum and Athena can read this manifest. They get the manifest location from Glue catalog, that’s where the metadata is stored. So this allows you to handle cases such as change data capture or GDPR needs when you need to delete data from your datasets. It integrates with Redshift and Athena, you can do your ETL on Databricks and access the data in Athena and data stays in parquet format on S3.
So I’ve talked about how, next slide yeah. I’ve talked about how lake house approach helps you build various architectures, but I want to talk, how does Databricks fit into this architecture? So you see the services around the perimeters, imagine that Databricks and other partners can be one of these services. They can participate both in outside in motion. For example, Databricks can write to S3, they can participate inside out. Databricks can read data from S3 as well, and they can participate in the around the perimeter where Databricks, for example, can integrate with a quick site for visualizations and what this I’d like to hand over to Dennis.

Denis Dubeau: Thanks, Igor for summarizing the benefit of the lake house approach and let me add that Databricks, as you mentioned, integrates with most of the AWS data and AI services today. So, please make sure to contact Igor or I, if you’d like to explore or further discuss any of these integrations.
First off, let me start by saying that Databricks has generally built the first lake house platform in a cloud, which offers one simple platform to unify all your data analytics and all your AI workload. We’re trusted by over 5,000 customers across the globe and we foster a unique culture of innovation, really expressed through the variety of very popular, open source projects, such as Apache Spark, Delta Lake and ML flow.
So in today’s economy, every company is feeling the pull to become a data company and at Databricks, our entire focus is on helping customers apply data to their toughest business problem. For instance, when large amount of data are applied to even simple models, the improvement on use case is exponential. For example, in the case of Nationwide with the explosive growth of data that’s available, and the increasing market competition are challenging all the insurance providers to provide better pricing to their customers, with hundreds of millions of insurance records to analyze for downstream ML, Nationwide realized that their legacy system was too slow and inaccurate, providing limited insights to predict the frequency and the severity of claims. So with Databricks, they were able to employ deep learning models at scale, to provide more accurate pricing predictions resulting in more revenue from claims.
Another example at Riot Games, which is a 2.5 billion gaming company that doesn’t even sell a game. What they do is they sell microtransaction into a free-to-play game. So for example, a microtransaction could be upselling you on a new skin or a new weapon and what they need to do in real time is provide the personalization within the game, so it provides the right offer at the right time for the right player. They do that by using machine learning models that are built on Databricks.
Stories like this exists in every industries, and now that this is a very small subset sample of our thousands of customers from various industries and verticals, and by the way, half of them are fortune 500 companies, unquestionably the key takeaway from this slide is that everyone of these organizations are making significant investment to become data-driven companies big or small and worldwide. And on top of that, Gartner has recently estimated that all AI projects will generate almost four trillions of dollars in business value by next year. This is four trillion dollars.
Now, unfortunately it’s true that most enterprise are still struggling with data silos all over the place these days, depending on what you’re actually trying to do. So if you take a step back, you can see that most customers are challenged with these four distinct stacks, a stack for data warehousing, one for data engineering and heavy ETL, a separate stack for streaming and rapid ingestion, and of course, ultimately a data science and machine learning stack, which are very different technologies that don’t necessarily interrupt great or play well all together.
Now, this is a very complex diagram with a ton of technologies that have to be stitched together, and you’ve got to hire resources with all these proper skillsets. These silos become very complex to manage and opens you up for governance and security issues. Moreover, it slows down the innovation process and it’s making it very difficult to iterate and collaborate across the personas of your organizations.
In the long run, all of these different disconnected systems, data formats, handoff between layers, and as you might imagine, this is very expensive and resource intensive for you, for your data teams to collaborate and incredibly difficult for the users to actually work together. This is a problem because most line of business don’t have a data science use case or a data engineering use case. They instead have a fraud detection, and a Customer 360 or demand forecasting use case and most organizations are like, “Well, there’s got to be something better than this”, and unfortunately there is, and it’s called a lake house.
We’ve now seen in the industry gravitates around this concept of a lake house pattern, which incorporates that the notion from the data lake, which supports all of your data in an open format that as Igor was mentioning earlier, that’s easy to access, but brings the data management capability of the data warehouse to provide the performance and high reliability, fidelity, and integrity to support all your workloads. So every cloud provider and system integrator in the industry are clearly promoting the lake house architecture patterns. And on top of that, most customers have the absolute similar desire to adopt the pattern, to simplify their architecture that we just saw earlier, very complicated, very challenging to drive business value, but in reality, they haven’t because it’s hard. Truthfully, this is why Databricks has invested so much in the lake house to provide one platform to unify all of your data analytics and AI.
So now let’s explain the core technology that enables the lake house platform. Igor touched a bit on that and I’ll expand it a little bit, but the Delta lake is an open-source project and hosted by the Linux Foundation. Linux Foundation is also known to support quite a few distinguished open source community projects like Lennox, Jenkins, and Kubernetes, just to name a few, but Delta brings a structured transactional layer on top of your data lake environment, giving you the best capabilities of the data warehouse and the best features of the data lake to provide the reliability and performance that you would expect and need from all your use cases. And so this is the foundation of the data lake house platform, which is available in AWS and on all major clouds and deeply integrates with the vendor cloud capabilities and offerings.
So the Databricks platform is unique in three ways. It’s very simple. Data only needs to exist once to support all your data workloads on a common S3 platform, it’s open, it’s based on open source and open standards to make it easy to work with your existing and future tools and it’s collaborative for all your personas from the data engineering, business analysts, data scientists to work together much more easily using our collaborative environment and multi-language support. This is really why Databricks is the data and AI company, no one else can say that their platform is simple, open-end collaborative within a single solution.
Let’s explore this a little bit more. With a lake house, you basically eliminate the data silos since much of your data remains in your data lake, rather than redundantly copying your data in other systems, it’s easy to maintain everything in S3, you don’t have to copy this data redundantly in other areas for other SAS providers or other SAS capabilities. Everything gets consumed directly from your S3 environment. So you no longer need separate silos of data architectures to support your existing workloads. Instead, you’re able to achieve one common data foundation and your Databricks provide the capability and the workspaces to have it on the horsepower, basically to handle all of these workloads within the one platform. So giving you one common way to do things across your entire organization and enterprise.
Databricks is built on innovation and we’ve built some of the most successful open source projects in the world, which really underpins everything we do. All of these projects are born from our expertise in the space and our commitment to open source is why we believe your data should always be in your control, and finally, with Delta Lake there’s no vendor lock in on your data.
Because it’s built on open technology, the Databricks lake house also unifies the entire data ecosystem. As we look at our ecosystem, this is a list of amazing partners from Ingestions, ETL, ML, BI, governance, and many system integrators that can easily work with this platform should give you an additional level of comfort about our lake house platform and its adoption across the partner community.
Ultimately, your unified data architecture in ecosystems allows your data team to work together more effectively until now we’ve mentioned collaboration a few times, but I really want to stress that Databricks is an incredibly collaborative platform of models, dashboard notebooks, datasets, that all personas in your organization can really use.
Think of it this way, your data engineer which is running a pipeline to prep the data for your data scientists creating a model and testing the products of the artifacts, perhaps by the data analyst, which is consuming, basically a one stop shop for all your collaboration to happen while staying within the single boundary of a unified platform.
The advantages of the lake house have been impacting many familiar enterprises, and I’m going to just touch point on, maybe I’ll take one of these and let’s look at Shell for example. They have about 70 plus use cases across the business from internal reporting decision-making and tapping into new energy sources like wind and solar, for example, they want to optimize their supply chains, they have the loyalty programs for planning and recommendation, they’re looking at lubricant analysis, and the challenge that they had or have is primarily the complexity and exponential growth of their databases and the lack of data engineering and data science skill sets that was needed to build a data powered solution.
They basically created a COE and part of that the solution was Databricks has been able to provide Shell with a unified platform that empowers hundreds of engineers, scientists, and analysts to innovate together really rapidly through the democratization of data analytics and AI. The lake house have been extremely powerful and then allowed them to federate all these use cases on a single platform and enabling their users and customers to rapidly deliver on business insights. Regeneron, Showtime, there’s many of those stories as well that we have but at the end of the day what we’re trying to do is to make our customer really successful in their journey.
In addition, the technology is one thing, but we really bring a strong, collaborative partnership to every engagement. We have expert service, we have a trusted advisor that can help you enable your users and your IT teams to adopt the platform. We do have a significant experience in partnering with thousands of customers in the very unique position to approve an approach across your data, your people, your processes, and putting it all together and have a structured set of workstream and activities that can help you quickly onboard thousands of users while providing the central capabilities to run a single platform, all of this on AWS and leveraging S3 as your foundation common data layer.
I want to leave you with a call to action here, there’s much more information to be learned about the lake house. We have a wonderful session throughout the summit of this week. Also, feel free to go to our to look at a lot more information regarding our integration on the AWS platform. There’s also this URL where you can get free training, so please feel free to look this up and sign up for all this great free training that we have on Databricks on AWS. We have this upcoming Dev day on June 17th, which is specifically around lake house on AWS and how we can accelerate your journey on the AWS platform using lake house architecture.
On that note, I think this is what we had today for you guys, hope you enjoyed the content, and also there’s definitely, stop by the booth, come see our office hours. We have numbers of good topic, either addressing how to deploy Databricks using QuickStart, we’re going to have further conversation around office hours around Delta Lake, and as well as our new private link rollout feature that we have on AWS. So we’re pretty excited to share this information with you folks.

Denis Dubeau

Denis Dubeau is a Partner Solution Architect providing guidance and enablement on modernizing data lake strategies using Databricks on AWS. Denis is a seasoned professional with significant industr...
Read more

Igor Alekseev

Igor Alekseev

Igor Alekseev is a Partner Solution Architect at AWS in Data and Analytics domain. In his role Igor is working with strategic partners helping them build complex, AWS-optimized architectures. Prior...
Read more