Jacques Nadeau is the CTO and co-founder of Dremio. He is also the PMC Chair of the open source Apache Arrow project, spearheading the project’s technology and community. Prior to Dremio, he was the architect and engineering manager for Apache Drill and other distributed systems technologies at MapR. In addition, Jacques was CTO and co-founder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (ADBE), and aQuantive (MSFT).
June 23, 2020 05:00 PM PT
In the era of microservices and cloud apps, it is often impractical for organizations to physically consolidate all data into one system. Apache Arrow is an open source, columnar, in-memory data representation that enables analytical systems and data sources to exchange and process data in real-time, simplifying and accelerating data access, without having to copy all data into one location. As companies continue to embrace modern architectures based on microservices and cloud applications, it has become increasingly difficult to physically consolidate all data into a single system. In a world where data is extremely fragmented, and users expect instant gratification, the age-old approach of constructing and maintaining ETL pipelines can be prohibitively cumbersome and expensive. Apache Arrow is an open source project, initiated by over a dozen open source communities, which provides a standard columnar in-memory data representation and processing framework. Arrow has emerged as a popular way way to handle in-memory data for analytical purposes.
In the last year, Arrow has been embedded into a broad range of open source (and commercial) technologies, including GPU databases, machine learning libraries and tools, execution engines and visualization frameworks (e.g., Anaconda, Dremio, Graphistry, H2O, MapD, Pandas, R, Spark). In this talk, we provide an overview of Arrow, and outline how several open source projects are utilizing it to achieve high performance data processing and interoperability across systems. For example, we demonstrate a 50x speedup in PySpark (Spark-Pandas interoperability). We then show how companies can utilize Arrow to enable users to access and analyze data across disparate data sources without having to physically consolidate it into a centralized data repository.