Powering Interactive BI Analytics with Presto and Delta Lake

Download Slides

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores. Delta Lake, a storage layer originally invented by Databricks and recently open sourced, brings ACID capabilities to big datasets held in Object Storage. While initially designed for Spark, Delta Lake now supports multiple query compute engines. In particular, Starburst, developed a native integration for Presto that leverages Delta-specific performance optimizations. In this talk we show how a combination of Presto, Spark Streaming, and Delta Lake into one architecture supports highly concurrent and interactive BI analytics. Furthermore Presto enables query-time correlations between S3-based IoT data, customer data in a legacy Oracle database, and web log data in Elasticsearch.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– All right hello everyone this is Kamil here co-founder of Starburst and I will be talking about powering interactive AI analytics with Presto and Delta Lake today.

In this presentation I first want to introduce Presto for those who don’t know what it is and talk a little bit about Starburst and what we do here to help enterprise adoption of Presto.

The main topic of my presentation is the Delta Lake integration that we’ve done for Presto and then sort of show how we can combine Presto and DataBreak, Spark, and Delta together in one data platform architecture and how to efficiently use the best, get the beauties of both technologies. And then finally show real use cases where that combination actually delivers the best results for your team. So with that let’s get going.

So Presto and Starburst, Presto itself is open source community driven project.

What is Presto?

At it’s core it’s a high performance MPP SQL engine.

It was designed specifically to be geared towards interactive analytics and perform on SQL analytics over large scale data now ranging from gigabytes on the low end to petabytes on the high end and the design goal for Presto was also to help with high concurrency queries and fast performance overall.

One of the unique capabilities of Presto is that it provides full separation of computational and storage layers so that allows you to scale those two independently, add more storage without adding compute nodes and then boosting compute if you need to perform lots of analytics and then scaling back when you don’t need them at that capacity. So allows you to manage cost and performance elastically and cost effectively.

Also Presto itself is just a compute layer, it accepts a SQL statement, does all the parallel execution of the query and then it’s done, right? There’s nothing in Presto about storing the data. So you have to kind of bring your own storage and storage is most efficient effective for your use case and most appropriate. So in a sense, Presto allows you to run SQL on anything. And with the powerful extenstion, if you’re with us you can actually connect to a variety of different sources and also write your own connectors to sources that are not yet available in the open source or commercially. The side effect of this is that you can also run federated queries meaning that in a single SQL statement, you can actually refer to tables coming from different sources. And that’s allow you to undo correlations of data coming from different sources and perform some interesting analytical queries that would otherwise be waiting for you to bring all your data into one place which obviously is a very long process and you may hit lots of barriers especially in the enterprise settings. Finally you can deploy Presto basically anywhere. We’ve seen successful deployments of Presto initially on Baro Metal on premises systems, then virtualized environments and cloud, now Kubernetes is really the way to go to deploy Presto in essentially any of those environments so gives you a lot of flexibility and you don’t have to be tied to any specific platform and you can move Presto analytical system from one place to the other and go where your center of gravity for that day is.

Presto Users

So Presto’s been around for some time, it’s been about seven years since it was originally developed by the team at Facebook and got initial adoption in large Silicon Valley internet companies and then sort of progressively expanding more and more companies globally different verticals and use cases so you can see some logos here on the screen that just gives you a sense of Presto applies pretty much anywhere for SQL analytical purposes. And it’s been the fastest growing SQL engine anywhere in the open source for sure. It can run at a massive scale so some of the largest deployments especially at Facebook, Uber, LinkedIn, Lyft, we’re talking about hundreds of machines doing petabytes of data and running thousands, hundreds of thousands of queries a day so it’s really really impressive what those companies are doing and pushing Presto to the limits on scale and performance.

Starburst Enterprise Presto

So now Starburst, we are the enterprise Presto company, we are the commercial arm behind the open source project much like Databricks is for Spark.

The value we are bringing to Enterprise customers specifically is around simplifying the usage of Presto provide you lots of security enhancements and integrations, data encryption masking, permissions, integration with LDAP, Okta, Apache Ranger, and other security related protocols.

In the enterprises you typically have lots of different additional connectivity needs beyond just open source open formats. You’ll be talking about querying Oracle, Teradata, DB2s, Snowflake, et cetera, et cetera and those are all packaged in the enterprise distribution of Presto that Starburst offers. On the management side as I mentioned ease of use is super important in the enterprise settings so we manage configuration, autoscaling, HA, monitoring, and just deployment in all those different environments and simplified orchestrate that effectively.

And all this is obviously wrapped into us providing support for Presto and of course we have the largest team of experts in the world, we have Presto creators behind the company right now and provide hot fixes, security patches 24 by 7 support and the thing that’s also very important and that’s also true for Databricks like we are the leading contributor and commuter to Presto itself, the open source project. So we are driving the roadmap and enhancing Presto to get you more performance, more scale, more functionality in the open source and then if you choose to go with our platform and manage enterprise offering, that’s where you get additional benefits as well. Okay so I’ve talked about how Presto can connect to many different sources already, so why get Delta Lake? We obviously got interested in Delta Lake when it was first announced by Databricks a couple years ago then subsequently when it was open sourced last year and Delta Lake is really really exciting technology for many reasons so I’ll just mention several of those here which I think are fundamental. First of all it gives you ACID properties over data lake and this is huge, it’s massive because you can now go and delete individual rows and update individual rows and insert individual rows effectively over your data lake which in the past was quite cumbersome and people did up complex frameworks to do that, that’s now simplified, everyone can treat object storage as essentially a database table which is amazing, that’s how we want to work and that’s how Presto is work so that’s great. It’s on open source table format so multiple different tools can use it, Spark obviously was the initial implementation now we got Presto that is connectivity for Hive and I suspect it will be much more broadly adopted in the future as well.

The actual data underneath is stored as a Parquet file, Parquet files which is great because Parquet is very very efficient format to store the data, it’s columnar, it’s compressed, it has built in min max indices and other performance optimizations which is really great for performance.

Delta supports object storage (audio cuts out) SV, ADLS, and other object storage which is where the world is going when storing large amounts of data these days that’s amazing. On top of all those basic fundamental benefits there are really great features around schema evolution and even time travel allowing you to see what would be the answer for this query if you ran it like a day ago and I think that’s very interesting for many workloads especially in the analytics space.

And then just to further show the benefits of Delta, having the dedicated metadata information that’s outside of high meta store, having statistics on the data, that helps a lot of performance because we can skip the data and if you have data arranged in special order, it also helps to speed up many analytical queries. So I think we all are very very thankful to Databricks for inventing Delta and making that popular file format. It’s getting lots of adoption, many of our joint customers are also users of Delta so it really makes a lot of sense for us at Starburst to enhance Presto and allow it to also query Delta effectively.

Native Presto Delta Lake Reader

As I mentioned Starburst developed a native Presto Delta Lake connector and I would like to acknowledge great help from engineers at Databricks that assisted us with that integration and provided further details in addition to the official specification to implement this effectively. So we decided to build a native connector completely from scratch so there used to be a legacy solution that allows manifest files and high meta store integration and simulate access from Presto to Delta but it was very inefficient because it wasn’t taking advantage of any of the inherent data Delta Lake properties so in this implementation we actually natively read the Delta transaction log, we perform data skipping based on the metadata read from Delta and we are able to fetch statistics about the data and basically leverage those as an input through Presto because it’s optimized and that allows us to effectively perform joins among Delta tables as well as Delta tables and other tables coming from different sources so it was really important to have this native integration built for Presto here.

Okay so once we built the very first version of that we were curious obviously how this performs to this legacy solution which basically treated Delta as a collectional Parquet file so we run the standard TPC-H benchmark and it covers 22 queries in this benchmark we saw on average (audio cuts out) for all the queries and obviously those queries are doing much more than just reading the data so that already is in. The fastest query that we observed in this benchmark which was a single table scan doing some aggregation was actually showing 6x performance boost with the native reader. So that’s pretty substantial and once we get the foundation that we’ve done sufficient implementation of the Delta reader we also shared this with our, as a preview with our early customers and the feedback we got was actually even more enthusiastic because they were reporting speed ups over 10x for the native reader versus the previous solution, so this drives adoptions, helps with performance overall and it makes users more happy to come back and run more queries and do more analytics so I definitely encourage everyone to try out this native integration of Delta with Starburst Presto and please go to the documentation link here on the screen.

So having that is great obviously engineering edition yet another connector for Presto but how you tie all those things together right to work together effectively in one environment and I thought I would spend a few slides on this topic, how to effectively leverage Databricks and Starburst in one architecture.

So the Starburst platform itself obviously at the center of that is Presto as a fast SQL layer that’s able to talk to many sources what we’ve done in addition to just leveraging open source Presto, we’ve built additional connectors to more sources, I mentioned some of those are traditional commercial RDBMS sources, very common in lots of enterprises, Oracle, Teradata, DB2 being the primary examples.

Starburst Platform

We enhance support for additional SV composite storage engines for on premises as well as the clouds, we are on all the clouds obviously and today and Presto can also talk to your modern no SQL stores such as Mongo, Elastic, Cassandra, et cetera. And so having a tool, one tool can reach to all those different sources both legacy and modern places where you may store your data is really really powerful. However, there’s always a challenge with all the diversity of sources how to do it effectively, how to do it securely, and how to manage the whole experience for the end users. So that’s what we sort of call as part of our platforms we call it a data consumption layer and a big piece of that was building a global security mechanisms that will govern secure access to all those sources from Presto and also from the users that are querying this data via Presto and sort of managing permissions, masking sensitive information, ensuring data encryption, both for reading the data encrypted at rest as well as encrypting data in flight when it’s moving from the source to Presto and within the Presto cluster and then between the client tool and the Presto cluster. You can audit all the queries, you can have verifying grained access control down to the table and column and actually you can also apply row filters and further restrict access on certain things. So all of those things are part of established platform edition and on top of that with the integrations and certifications with a number of your favorite BI tools you’ll look at a lot of RBI (presenter mumbles) enabling that plus all the modern tools that can issue SQL starting from Jupyter Notebook, Superset, Redash which I think by now is part of Databricks, all those tools can talk and speak to Presto natively and gives all the users power of the broad set of connectors and security features of Starburst. So that’s on the Starburst side and now it’s a question a lot, I have Spark, I have Databricks, I have Presto and Starburst, how do those things work together, what’s the best way to effectively leverage both?

Different Technologies In Your Toolbelt

And what we see in real life deployments with our customers is that pretty much everyone who leverages Starburst and Presto, they also leverage Spark and often that’s Databricks Spark.

And the reason is that I think those technologies excel at different things and compliment each other very well so if we are talking about streaming ingestion of data, we’re talking about big machine learning jobs or digital intelligence, obviously managing data lake, doing really really heavy long ETL jobs, all those things whether you do it through just native Spark syntax or whether you leverage Spark SQL, all of those things are best positioned to happen inside Databricks and inside Spark and Spark is the way to do this effectively. On the other hand, Presto really it was designed and excels in high concurrency SQL so running tens of hundreds of queries at the same time, if you’re doing some BI reporting analytics, interactive data discovery, using SQL and you want to also federate different sources so you have built a lake and Parquet files obviously Avro also relational databases, no SQL engines, this is where Presto can provide a lot of value and we feel that interactive faster performance is enough of a distinguishing factor that drives adoption of Presto and Starburst and because we have now so many joint customers we feel really compelled to really advocate for a strong architecture.

Data Ingestion and Analytics Ecosystem

If you look holistically at the data ingestion and analytics ecosystem, this is all coming together by having your raw data sources all being ingested by Databricks and Spark into a data lake these days Delta Lake specifically but you can also just put your data on either Amazon SV or Azure DLS and you can run all the machine learning in guide, end flow, or Sagemaker, for those use cases well if you want SQL over this data plus correlate this information with lots of different data sources that are either RDBMS no SQL and all over the place like Starburst and Presto is perfect answer and provides higher and more responsive, high concurrency more responsive SQL access and allows you leverage RDBI tools and SQL editors and Redash, Superset, Jupyter Notebook, and your favorite tools for analytical purposes for the more classical analytical purposes.

Deployment Architecture

In terms of the deployments this is very interesting as well. These days we highly recommend deploying Presto on the Kubernetes cluster and in fact I will be showing Presto deployed on Amazon EKS Kubernetes service later in the demo but we see this pattern of Presto being deployed by a Kubernetes on Azure, AKS, in Google Cloud, in GKE, also (audio cuts out) we work closely with Red Hat and Openship platform for the Archimedes deployment and we’ve seen people deploying Kubernetes pretty much anywhere but as I mentioned you can always do it in many different ways as well if Kubernetes not yet adopted at your company.

We simplify all that configuration, deployment, management sufficiently so it’s really very very integrative way to do all those.

Deployment and management simplified drastically to what you’ve been used to in the past.

Okay so now bringing all this together, right this is all great architectural technology. What’s the real use cases that we can show here?

The one use case that we want to advocate for and we’ve seen this already being leveraged is in a joint architecture Databricks and Starburst we have obviously some IOT data streaming from the IOT devices streaming into Delta Lake

we may have some data coming from classic ERP system, RE batches moving (audio cuts out) from the application layer into again Delta Lake specifically and we can go through all those different layers from ingestion bronze refined silver layer of Delta Lake and then to the aggregate store or the gold layer, which is the best position for fast analytics. And Starburst can come in, everything before will happen in Databricks, now Starburst comes in and allows you to run your fast highly concurrent SQL and fetching data from your aggregate store as well as reaching out to more or less your fine source versions of those tables if needed. If your analyst data scientist prefers to look at the very raw data, that’s also possible from Presto. Now while we have a lot of data being ingested into this data lake it’s not going to be everything you have potentially in your architectural system. We’ve seen a lot of companies coming with data lake strategies and having the center of gravity there, however, your RDBMS traditional Oracle RDB2 et cetera are always there and there is sometimes very interesting data there that you better provide where the information coming from other sources. And for other data sets like more textural data sets say web logs and user comments, less structured data,

things like elastic or MongoDB are really proper ways and places to store them.

Now since your data is spread across so many places you need to be able to query all of those at the same time quite often to arrive at the best insights that can drive your business forward and this is what we can provide here with Starburst and all the connectors we have too for those pathways.

So as I mentioned we’ll be doing some of the real time ingestion of event data and I’ll be showing this in the demo in a moment we’ll be doing also the hourly jobs of putting the more traditional enterprise data and then reaching out to data sources as well and we modified this and refined this and prepare this data for further analytical needs as appropriate in our overall architecture.

So obviously Starburst here provides this single access layer it’s no longer true that we have all the data in one place so there’s no single source of true for data but there’s single access point to query this data and that’s a virtualized query environment is where the power of Presto is.

You can leverage the power of the proper data stores for your data and query data right now rather than wait for maybe ETL job to bring this Oracle legacy data or ex data into one place

because it’s hard to imagine how you want me to modify and prepare this data sometimes for analytical purposes, if you’re given access to this elastic search you can actually push down some processing there and find appropriate data by doing that ex analytics essentially and bringing that back to the rational world and Presto service can provide a global access control for the services so your analysts are properly privileged to data that’s meant for them to be analyzed. So now in this demo obviously the end users will be leveraging BI tools, SQL editor such as Looker, Tableau, Power BI, leveraging connectors like JDBC drivers ODBC and libraries for other languages, and we sort of now want to show all of this in a demo and briefly comment how you put all those things together. Okay so we are in a Databricks notebook here and we have obviously SV bracket mounted and we are setting up a structured streaming ingestion from a kafka topic into a stream that will then save into a Delta table. So setting up here, making sure the connectivity is live, start stream, checking the schema, making sure it’s matching what you expect and now basically issue a command to receive this data and save into a Delta lake and verifying that the stream is actually live, okay so we have ingestion going on. Now we are switching to DB viewer which is a SQL editor tool in which we have mounted all different data sources such as Delta lake and I’ll be trying out to query Delta via Presto from DB viewer client both came up very quickly as you can see now showing the same for elastic search so this is elastic search set up in Amazon Cloud we can query that very quickly as well individually so that’s perfect and then on Amazon RDS we have an Oracle system with some customer information and that together now bringing this all together into one SQL statement we can correlate this information and do the drawing in between all these customer

and execute the SQL and here in the Presto web UI we can see how queries progressing, we can monitor the state of the cluster, and see the progress of this pretty complex joint query with four tables and then see the results back at the DB viewer screen here. Now switching back your BI classic BI tool tableau we have a dashboard we are creating all the same information and displaying the orders in a geographical dashboard

visualize which orders are coming from which countries so this actually pretty quickly we have other statement being run and all of this is now reflected on the dashboard again and the difference is here that we have a virtual view of federating all the sources, still the same Delta search Oracle is free and now in addition to dashboard you can also run a classic report here we are showing customer per country per region and this readout in classic report style visualization and all of this is interactive live queries against the real data coming from all those different sources during query time. So I hope that the session was helpful and you see the power of both those together and I really thank you all for attending.

Watch more Spark + AI sessions here
Try Databricks for free
« back