Data Science Across Data Sources with Apache Arrow

Download Slides

In the era of microservices and cloud apps, it is often impractical for organizations to physically consolidate all data into one system. Apache Arrow is an open source, columnar, in-memory data representation that enables analytical systems and data sources to exchange and process data in real-time, simplifying and accelerating data access, without having to copy all data into one location. As companies continue to embrace modern architectures based on microservices and cloud applications, it has become increasingly difficult to physically consolidate all data into a single system. In a world where data is extremely fragmented, and users expect instant gratification, the age-old approach of constructing and maintaining ETL pipelines can be prohibitively cumbersome and expensive. Apache Arrow is an open source project, initiated by over a dozen open source communities, which provides a standard columnar in-memory data representation and processing framework. Arrow has emerged as a popular way way to handle in-memory data for analytical purposes.

In the last year, Arrow has been embedded into a broad range of open source (and commercial) technologies, including GPU databases, machine learning libraries and tools, execution engines and visualization frameworks (e.g., Anaconda, Dremio, Graphistry, H2O, MapD, Pandas, R, Spark). In this talk, we provide an overview of Arrow, and outline how several open source projects are utilizing it to achieve high performance data processing and interoperability across systems. For example, we demonstrate a 50x speedup in PySpark (Spark-Pandas interoperability). We then show how companies can utilize Arrow to enable users to access and analyze data across disparate data sources without having to physically consolidate it into a centralized data repository.

Watch more Spark + AI sessions here
or
Try Databricks for free

Video Transcript

– So, thank you for joining this session. It’s great to be here at the Spark and AI Summit 2020. Even though this is virtual, it’s still a pleasure to be here.

This talk will be about data science across data sources with Apache Arrow.

So, first, I’d like to just give you a little bit of background first about myself. My name is Tomer, I’m the co-founder and chief product officer over at Dremio. My email here’s on the left. So, you can feel free to reach out to me after the session, or at any time, if you have any questions or suggestions. Just a few words about the company. Dremio is the Data Lake Engine Company. So, we enable lightning fast queries directly on data lake storage like S3 and ADLS. And, we make the data consumable with a semantic layer. And, we are now powering the cloud data lakes for many of the world’s largest companies and most known brands.

We also created an open source project called Apache Arrow, which we will be talking about extensively in this presentation. Or, actually co-created that project. And, we’ll cover that in a second. The company’s raised over 100 million dollars. So, you can see here examples of the investors that have invested in Dremio.

All right, so the problem that we’re talking about here is that data lake storage is exploding. And, what I mean by that is that systems like S3 and ADLS have become the default hit bucket in the cloud. So, when companies move to the cloud, or even if they’re born in the cloud, all data, or almost all data, tends to start in these systems, in S3 and ADLS. Right? That’s the first place where people put that data. Their data. And the reason for that is that these systems are infinitely scalable, they’re extremely inexpensive. Right? They’re about $20 a terabyte per month. You only pay for what you use. You don’t have to worry about high availability, or disaster recovery. It just works, right. And so, as a result, most cloud architectures, most application architectures, involve kind of dumping the data first landing it into something like S3. And, as a result, those systems are growing exponentially. The problem is that if you’ve ever tried one of that kinda typical SQL engines that are out there things like Hive, Presto et cetera. And you probably very well aware that querying the data directly in data lake storage is not a great experience right? It’s not very fast, you end up having to kind of go through all sorts of gymnastics, and even then you really can’t get the BI users to be happy with the experience there.

So what typically happens as a result of that poor experience, that folks experience when they try that out, is that they start building this really complex architecture, right? And that if you look at this slide here you can see that the users at the top, we call them the data consumers. These range from people that are using visual tools like tableau and power BI all the way to kind of more technical users that like to write SQL by hand, or maybe they use kind of tools like Jupyter Notebooks. And then you can see here at the bottom these are the sources of data. So the data gravity being in this data lake storage system like S3, and then perhaps some other sources as well. So, what do companies do, well first of all, they start selecting a subset of that data to move into a data warehouse. So instead of querying the data directly, which they can’t do it efficiently, and with high performance, they copy some data into Redshift or Snowflake, or if it’s on-prem maybe Teradata, right? And then of course that’s hard to do. And as we all know, those data warehouses are very proprietary and very expensive as well.

And then getting the data into the data warehouse that actually often isn’t the end of the story, right? That’s really just the beginning.

The next step here is that the company will often then have to move the data or a subset of the data from the data warehouse into additional systems such as Cubes and BI extracts and Aggregation Tables, right? So those are the things you see here at the top. And finally, once you’ve done all that, once you’ve built these cubes and built these extracts, the user at the top who’s maybe using this BI tool they can have a good experience on a kind of a little sliver of data, right? They can connect to that one cube and things are pretty fast. But they’ve lost all flexibility in that process, right? They can’t then go and join whatever’s in the cube with the bigtable that’s in S3, right? And maybe cataloged in the blue catalog. They can’t make changes to things, right? It’s very, very limited and see you’ve lost kind of that agility and flexibility. And then of course, from an IT standpoint, from an engineering standpoint, you’ve had to go and build this really complex kind of ETL which is represented with these lines here. And that’s often billions of dollars. And it’s not just building it, but you have to maintain it, right? So you have to keep that thing up and running. And people come in the morning and their data sets didn’t get refreshed and that they start screaming, and so it just becomes a nightmare. And often in addition to that, the data engineering team gets bogged down in kind of reactive work. They’re constantly having to help these analysts that are at the top here to do their work. So, this is not a great architecture at all. I describe it as a workaround, really it’s being done only because the data could not be queried fast enough and in a easy enough way directly from where it was. So, what we’ve done at Dremio is basically we’ve built a system that enables us to eliminate that kind of light gray triangle that you just saw. And really for the first time be able to query data directly in data lake storage with interactive performance. And so if you compare it to other kind of dedicated SQL engines that are out there, like Presto for example, Athena, and you’ll see an order of magnitude improvement in performance or more. And when I say more, I’m especially talking about kind of BI workloads where you could see massive many orders of magnitude performance improvement. So how does this work?

Dremio was powered by Apache Arrow. Okay, and what I’ll talk about now is basically what the Apache Arrow project is and why we created it. The technical ways in which we’re using it to enable high-performance both for kind of the I/O part of the query execution, as well as just the CPU efficiency. And then we’ll talk about some of the new initiatives in the Apache Arrow project and how they’re going to change the way in which we think about building these open architectures and building these data lakes, not just for users of Dremio, but in general as an industry. So let’s start with the Apache Arrow project in general. So what is it and why did we create it or co-create it a few years ago?

What is Apache Arrow?

So Arrow is a columnar in-memory representation. And what that means is that the data is represented column by column in memory as opposed to kind of a row based approach which is what most systems out there use in memory. So we’re all familiar with standard columnar representations on disk, things like parquet and ORC files. And by Arrow, the reason we introduced Arrow is we wanted there to be an equivalent of that, that is designed for in-memory. And so that’s what Apache Arrow is. It has significant contributions from the community, over 300 or maybe 400 developers have contributed across many, many different companies ranging from ourselves to the Python community, folks like Wes McKinney, Google, Microsoft, Intel, et cetera, et cetera. And as a result of that, we now have a lot of infrastructure in this project, including over 10 different language bindings. So you can actually use Arrow from kind of the core languages of Java, C++, and Python, but also things like Rust and R and other languages as well. And if you go to the website, if you go to the Apache Arrow website, there’s a powered by Arrow, and you’ll see a variety of different companies and technologies that are now leveraging Arrow in one form or another, right? Maybe they’re not doing it to the extent that we are at the end of the day. Arrow was born out of Dremio’s internal memory format. So, we use Arrow everywhere in kind of in the engine. But many of these technologies have adopted Arrow for various purposes, right? Maybe it’s just to transfer some data on the edge or things like that. But this broad industry adoption is very significant in terms of where this project is heading. All right, one of the really interesting things here about Apache Arrow is the exponential growth, and actually you can see here a little dip in a few months ago, and I think that’s when kind of this whole pandemic started and, folks were probably more worried about not catching the virus and figuring out how to work from home.

But if you look at this, graph here, we’ve grown from very small numbers two years ago to over 13 million downloads per month in the last month. So you can see this massive exponential growth with many millions of downloads. And in fact if you are using Python to do any kind of data science, or you have people in your organization who are using Python to do any kind of data analysis, chances are that they’re already using PyArrow, for example. So this is very widespread and really used today by at a minimum by every data scientist, probably in your organization.

So I wanna talk about a couple of things that we’ve done internally within our technology leveraging Arrow, just to show some examples of how Arrow is helping just from a performance standpoint. As I mentioned Dremio is internally based on Apache Arrow, everything we do in our engine is in the Apache Arrow format. So the first thing we’ve developed, and this is also part of the Apache Arrow project, it’s something that we’ve contributed to the project as well, is this thing called Gandiva. And Gandiva is a compiler, it is based on LLVM. So, what Gandiva basically does, is it takes a SQL expression not a whole query but just an expression. And it translates that into machine code that is highly optimized for running in a vectorized way inside of an Intel CPU to leverage kind of the latest vectorized instructions, cmd instructions in the CPU. So basically what happens here is we’re leveraging LLVM if you’re not familiar with LLVM, it’s a compiler, it’s actually used by many companies, I think a lot of it was originally built by Apple. And the expression first gets compiled using something called IRR builder into this intermediate representation into what’s called IRR. It’s kind of this intermediate bytecode representation. We then merge that with the the bytecode representation of the function library that we’ve created and contributed to the open source community. Those two things get merged, and then LLVM goes through a number of optimization phases, and I’m not talking about SQL optimization, I’m talking about kind of low level hardware and CPU optimization here. And that produces that vectorized execution kernel. So by the time Gandiva is done compiling, which is a matter of milliseconds to compile this expression. We now have something that can run extremely efficiently and basically take these in-memory buffers in the Apache Arrow format, and basically run that process sitting on that buffer. So it takes an Arrow buffer or an Arrow vector, runs it through that execution kernel produces the output and it takes the next vector and produces another vector. And that’s basically how internally, how are kind of projections, how are filters in the engine at work. So let’s just a little bit about Gandiva, this chart here shows you some of the performance enhancements that we’ve seen prior to developing this technology. Our runtime code generation was actually in Java as it is many other systems that are out there. And you can see the performance improvement that this innovation has provided. And so these are actually, the tests here are just happened to be a number of tests that we were provided by one of our customers that wanted their queries to be faster. So, it’s various projections and queries that have case statements. So nothing too fancy here, just very basic kind of projections and filters. And you can see here the time that it took with the previous kind of Java runtime compilation capability, and the amount of time that it takes with this LLVM based solution called Gandiva. And you can see that the performance improvement is anywhere from 4.66 to almost 90x performance improvement. So going from that Java approach, which was also by the way, runtime code generation and compilation to this technology has provided significant benefit. And again, this is part of the Apache Arrow project. So another example of the way in which we are leveraging Apache Arrow to make things faster is through our columnar cloud cache, which is all about accelerating I/O. So if you look at this picture on the left, Dremio is a, it’s an execution engine that actually uses multiple engines. So if you contrast this, so that compare this with kind of the traditional on-prem, traditional SQL engines that were kind of developed for on-prem like Presto and Hive, et cetera. Those systems basically run one single cluster, right? So we think about it as engine. So they run one static engine, you size it for your peak workload and that’s what you’re running, right? And then there’s just some elasticity you can grow and shrink. But you don’t have this kind of workload isolation. You don’t have the ability to have different engines that start and stop automatically. What we do at Dremio is we have a concept of engines, right? And so you’re running on AWS, you can have a medium engine, an extra large engine, you have different engines for different workloads, and that drastically reduces your EC2 bill, right? So that’s the idea there is really reduce it, the cloud infrastructure costs and also provide the isolation between different workloads. Each of these engines basically is a set of EC2 instances, right? At the end of the day computers running on EC2 instances, and those EC2 instances, they now come with ephemeral SSDs or NVMes on each of the EC2 instances. And so we’re taking advantage of these local ephemeral NVMes on these engines to cache data that’s being accessed. Because one of the great things about using something like S3 is the fact that it’s infinitely scalable and inexpensive, and you have that separation between compute and storage. But one of the downsides is you have this high latency because the storage is not local. And we’ve completely solved that problem by introducing this caching layer, it’s a distributed cache that runs across the engine.

And leverages that local NVMe to cache data that’s being accessed on S3. So if you think about it, most cells in a dataset, most columns in a parquet file are accessed by more than one query, right? They don’t just get access once. And so if that is the case, there’s no reason to go back to S3 the second time, right? If the system is smart enough to kind of schedule the query execution on the kind of inconsistent ways and leverage that local cache version, the next time that data needs to get accessed perhaps by a very different query by the way, then you get a lot of performance benefits. And what we’ve done here is not only are we caching the blocks of the parquet file on S3 for example. But beyond that, we are actually in some cases also caching that in an Apache Arrow representation as well, or instead of the parquet representation. And for many workloads especially when you’re talking about a high concurrency BI workloads, where you have hundreds of concurrent queries, maybe you’re creating BI dashboards or running BI dashboards. You don’t want to have to do the kind of the decompression and deserialization of the parquet file every time you’re running a query. And so what’s happening here is by caching the data in the Arrow format, which for us means our internal memory format because again Dremio uses Arrow internally. We can then bypass the need to decompress the data, the parquet columns and deserialize them at query execution time. So this provides a lot of benefit from an I/O standpoint, both by eliminating the need to go to S3 every time, and also by eliminating the need to decompress the parquet columns every time the data is accessed. So that’s, a significant speed up also reduces the costs that you have to spend with the cloud provider, because you’re no longer paying for the S3 gets inputs. And that’s often about 15% of the cost of a query amortized. And then you’re also saving on CPU cycles, which of course means again, smaller less compute infrastructure to pay for.

All right, so those are two examples of how we’re using the Apache Arrow project internally to make things faster. But, most of our customers for example, they’re not just using Dremio, the beauty of the kind of the open data platform the open architecture is that you’re able to use lots of different technologies, right? And you can choose best of breed services and best of breed technologies to process the same data, right? This is very different from, I think, what the data warehouse vendors are advocating, which is, “Put all your data in my data warehouse, “this proprietary thing, and you know, yeah, “you’ll be locked in, but there’ll be benefits to that.” So we fundamentally believe in a very different approach, which is a much more open approach in which the storage is provided by the cloud providers. So things like S3 and ADLS on Azure, that data is stored in open formats. And so by open formats, I mean, first of all, file formats that are open source things like parquet and ORC. And then of course JSON, and kind of text delimited files as well. And then the table representations on top of these files, they’re also open. So you have things like Glue and Hive Metastore which can be accessed by many different engines and technologies. And then more recently you have things like Delta Lake and Iceberg providing transactions and mutations of data.

So once you have kind of this common storage layer, you have the kind of an open data representation, and you can use a variety of different compute engines and some of these exist today. And then others will be invented, next year and in subsequent years. And you’ll be able to use those on the data as well. And so that’s the beauty of having the kind of the open data architectures. You can pick and choose the types of systems that you’re using. And I’ve put a few examples on this slide this is by no means all the systems that are out there. But at Dremio our focus obviously is enabling you to do interactive SQL and BI on data lake storage. And Databricks of course provides an amazing system for data scientists and data engineers. Amazon provides a system called Athena, which is great if you’re running some, I call it occasional SQL. So, once in a while you’re running a query, you only pay for the amount of data scanned. So if you’re gonna use it for BI dashboard, it’s gonna be 10 times more expensive than Dremio. But if you’re using it to run a few queries a day, then that’s probably a fine system, especially if you don’t care about any of the performance. And then you have EMR, which is a kind of a collection of different, I’d say big data or Hadoop services kind of it being managed and in clusters. And so that’s also an Amazon service as well. It’s been around for a very long time. And then at the top here, you see various different tools. So this is all great, I think most companies I talk to they’re big believers that this is the future, the open data platform. But let’s talk about one of the things that we need to do here in order to make this work even better in the future. So if you think about all these different systems here, right? You have the kind of the storage, you have the data, the compute, the clients maybe some other systems which you see on the right here, like databases.

There’s a need to move data around between these different layers and even within some of the layers and to do that very fast. So one example here is, I’m a data scientist, I’m doing something in a Jupyter Notebook, I want to get the results from a query that I’m running in Dremio and I wanna get that really fast, right? I don’t want to wait a lot of time to do that. Another example here is you think about two different things inside of the…

Another example here is I have data in a data warehouse or kind of a relational database, and I want to pull data from that system very fast. So, we’ve started for example working with Microsoft on exposing an Arrow interface. And I’ll talk about what that kind of thing looks like as well. Another example here of where we would want really fast communication is between different compute engines. So for example, I want to move data between Dremio and Databricks, I wanna move in it very, very fast, right? I don’t necessarily want to have to write it to something like S3, Delta Lake et cetera. Maybe I want to move that in memory.

And then finally, sometimes I just have to pull the data from kind of the storage service and I want to be able to do that even faster, right? And that’s especially true when you start thinking about services like S3 Select and the query acceleration capability in Azure storage, which does some kind of filtering, some projections and then returns the data. So having a very fast way to return data to a client that is potentially distributed, that is also something that can help. So you have all these needs to move data around between different things in this open data architecture and to move them extremely fast. So how do we do that?

So one of the things we’ve added with the community to the Arrow project is something called Arrow Flight. And Arrow Flight is an Arrow based RPC interface. So now that Apache Arrow has become this industry standard way to represent data in memory in a columnar format. And it’s downloaded 13 million times and used by thousands of companies, basically every data scientist out there and a lot of databases.

It only makes sense to take advantage of that fact that it is so pervasive and so efficient to make that the foundation of a new RPC interface, a new high performance wire protocol, where data can be transferred in parallel streams and between different systems, right? Arrow Flight supports both client to cluster communication and also kind of paralyze the cluster to cluster communication, which I’ll show you more in a second. So very simple example, I felt like I had to have some code in this presentation. So here it is the an example using Python of how a single client say on your laptop would communicate with a system that is exposing an Arrow Flight endpoint.

Arrow Flight Python Client

So you can see here on the left, kind of a visual representation of a flight, a flight is essentially a collection of streams. And each stream is represented by an endpoint that you can go take a ticket and ask for that stream. So, you can see here, what we’re doing is we’re first of all leveraging the PyArrow project. And within that we have the flight package. Then we are creating a client in this case, I’m connecting to an endpoint that’s on my local host at port 47470. I am basically providing the SQL command as the descriptor. So in this case that in flight, you can have any descriptor basically. In this case I’m using a SQL query just as an example of that. And it’s a returning a flight descriptor. The next thing that you see here is we’re taking that flight descriptor and we’re returning, what’s called a flight info, so fi. And that flight info basically contains the ticket that I can then use to read the data. So then I go to that endpoint, and this is a very simple example there’s actually only one endpoint. So I don’t have to, I’m not kind of traversing through the the different endpoints. But a different implementation of flight might be paralyzed, and then I’d have to go through all the different endpoints and provide the right ticket to each one. So what you see here is a

kind of the flow of messages that you get when you have a single client and a cluster. The cluster is exposing the Arrow Flight interface the way we do at Dremio this is currently in public preview. And in this case, the client is basically obtaining the flight info from the coordinator, so the coordinator is think of that as the quarterback in our system. And so that flight info contains basically an array with all the locations and all the endpoints. And the endpoint basically is that the location, where do I go connect, what node and what port. And of course the ticket to use to access that. So in this case, there are three end points sitting at locations two, three location two, location three and location four. And,

each of those that you see ticket one, ticket two, ticket three. So once the client has obtained by providing the SQL statement, it has obtained the flight info. It then goes in parallel to each of the executors and requests the kind of the stream of data that’s coming back. So this allows Dremio to return data in parallel to a client from all the Dremio nodes.

Another example here is, let’s say you had, and this is currently being developed is if you have a spark cluster for example, and you want to pull data from a Dremio cluster, and this could be any two clusters, and I’m just using this as one example here. You can have kind of parallelism as well in this regard, right? So you would have the spark context communicating with the Dremio coordinator. It would be requesting based on the definition of a SQL query, and obtaining the end points and the ticket for each endpoint. And then different workers in this spark job would be requesting the data and getting kind of a live stream of data from different executors. So this allows you to have kind of a parallel stream or a set of streams running between two different clustered systems and being able to pull data, basically it’s saturating the network between these two systems. So that’s what will allow soon the ability to have clusters communicate with each other.

So with that what I want to do now is I want to jump into a a quick demo. And what I’ll show you in this demo is first of all, just kind of introduce you to kind of the Dremio interface, show you how we run a query and then show you what Arrow Flight basically does to these to the ability to query the data. So I’ll show you what the performance looks like when you’re querying the data through something like ODBC. I’ll show you what it’s like through a very high performance ODBC interface, something called turbodbc. And then I’ll show you the performance with Arrow Flight. And I’m actually only gonna show you the performance of Arrow Flight without the parallelism, but even that alone, you’ll see how significant the performance difference is. All right, so let’s get started here with the demo. Okay, so what I want to show you here is a demo of how Dremio leverages Apache Arrow to achieve higher performance. And also how Arrow Flight is being used to provide a faster interface for client applications. So let’s start with a very, very simple example here. What you’re looking at now is the user interface of a Dremio cluster. So this is a a very small environment, it actually consists of a single engine, so just a one four node engine. And this one is of course running in the cloud. Because this is a demo cluster that’s used for many different things, we actually have it connected here to a number of different clouds. So you can see it’s connected to my S3 buckets. It’s connected to a bunch of Azure and data Lake storage, so ADLS datasets, you can see different folders are basically datasets here. We could connect it to our Hive Metastore, to our Glue catalog to the Postgres database here, we’re connected to one of those as well. So those are the data sources and I’m an admin so I can see all of them right now. But of course in a real world scenario you would actually limit who can see what data. What you have here is the Dremio semantic layer. So we organize datasets or virtual datasets in spaces. So you can see the spaces here and these virtual data sets allow you to curate the data and present different views to different users. And then every user also has their own personal space where they can create their own virtual datasets, they can also upload their own spreadsheets. So for example, if somebody wants to join a small number of records in a spreadsheet with a massive data set that exists in an S3 bucket, they can very easily do that without needing any help from data engineering. So let’s look at a simple example just to see the impact that Arrow has on the performance here. So typically if you were to query data in S3 using something like Hive, Presto, Athena with this kind of environment with this size environment or this cost. You’d be looking at probably a few minutes per query on the dataset that I’m about to show you. So let’s go in here and show you what that looks like with Dremio because of Apache Arrow. So this is a dataset of over a billion records, again, very small environment. So we’re not spending a lot here on infrastructure. You can see the vendor ID, the pickup time, the drop off time et cetera, number of passengers, the trip distance in miles. And I’m going to connect to Dremio from Tableau. So we have this very nice integration with the kind of the key BI tools, close that one here.

So in this case, I’m running on a maximum using Tableau

on my laptop here.

So what you see here is a Tableau connected with a live connection to Dremio. And I’m going to start interacting with this dataset, which will in turn just send queries to Dremio. So you can see here the number of records we’re doing a sum or account star of this entire dataset billion records. I can aggregate based on the time of the taxi trip, the drop off time. And in this case by default it’s aggregating by year. If I wanted to do an aggregation by something else, let’s say the week number, I can look at weekly patterns. So you can see here the the amount of taxi trips per week. And I can look at the trip distance in miles and say, I want it to for example look at those patterns, maybe change that to the average trip distance, which makes a lot more sense than the total. And so you can see here, the kind of the weekly trends with taxi trips in New York. The key takeaway here is that we’re achieving a very interactive performance, actually sub-secondary response times on all these queries. There are a number of different technologies that are making that possible. Some of them are more related to Arrow. So first the fact that Dremio is using Arrow internally for execution and using the Gandiva project rather than Java to execute. So Gandiva is using the LLVM compiler, which was created by I believe by Apple. And so that’s one aspect that allows us to leverage Intel cmd instructions as part of the execution much more efficiently. And then the second thing that Arrow, the second role that Arrow is playing here is basically the C3, the columnar cloud cache, which is leveraging Arrow persistence to further accelerate access to S3. So both of these things are among what’s enabling us to achieve very fast response time here. In addition to something we call data reflections. So now let me show you what Arrow Flight, which is that high speed wire protocol based on Apache Arrow, and also part of the Arrow project can enable from a client access standpoint. So I’m going to head back into my browser here and now instead of using a BI tool, I’m going to use a notebook. Okay, so this is a data science use case I’m using Python. I’m gonna start here by just importing all these Python packages. You can see the query that I’m gonna use here. This query is very simple all it’s doing is pulling a million records from this taxi dataset. So we’re just select star limit a million and a few functions here just to set things up. But basically what we’re going to do is we’re going to use three different methods of accessing the data or pulling the data from Dremio. One is going to use standard ODBC, so just pyodbc. The second one is going to use a very high performance ODBC implementation that actually leverages Arrow internally, believe it or not and it called turbodbc. And then the third is going to use Arrow Flight, which is now in public preview in Dremio and of course part of the Arrow project. So what we’ve done here is we’ve defined the three functions. You can see here this is that code that we were looking at earlier, leveraging get flight info and providing the ticket to achieve, to obtain the stream. So we’re going to run that definition. And then this final step is we are going to run this query three times. One using ODBC when using turbodbc,

and when using Arrow Flight. And we’re going to basically plot the time that it takes in seconds to run that simple query that basically just pulls a million records. Now Arrow Flight doesn’t play an important role in cases where we’re talking about very small datasets. So if you’re thinking about like a dashboard you don’t really need this because maybe you’re pulling the all 20 data points for a bar chart, right? But when we’re talking about in this case a million records or if we’re talking about many millions of records that I’m trying to read into a data frame this actually plays a very important role. So you can see here, this came back and ODBC took about 20 seconds, turbodbc took about seven seconds and then flight took about one second. So you can see a significant difference and actually Arrow flight in this case, it’s not even paralyzed we’re using the kind of the simple version of it, single client, and it’s achieving orders of magnitude better performance. And in fact, if we increase the size of the dataset, the 5 million records Arrow Flight is still gonna be around one second, one or two seconds, and you’ll see these other bars increase significantly. So with that, I’d like to wrap up this demo, hopefully that gives you a sense of what Apache Arrow is doing to enable very high performance data processing on data lakes, as well as a very high speed communication or data exchange between systems. So thank you very much for attending this talk and feel free to reach out again [email protected]

Watch more Spark + AI sessions here
or
Try Databricks for free
« back
About Tomer Shiran

Dremio

Tomer Shiran is the CEO and co-founder of Dremio. Prior to Dremio, he was VP Product and employee #5 at MapR, where he was responsible for product strategy, roadmap and new feature development. As a member of the executive team, Tomer helped grow the company from five employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion - Israel Institute of Technology.

About Jacques Nadeau

Dremio

Jacques Nadeau is the CTO and co-founder of Dremio. He is also the PMC Chair of the open source Apache Arrow project, spearheading the project's technology and community. Prior to Dremio, he was the architect and engineering manager for Apache Drill and other distributed systems technologies at MapR. In addition, Jacques was CTO and co-founder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (ADBE), and aQuantive (MSFT).