Apache Kudu and Spark SQL for Fast Analytics on Fast Data

Download Slides

Apache Kudu is a new, open source storage engine for the Hadoop ecosystem that enables extremely high-speed analytics without imposing data-visibility latencies. Using Spark and Kudu, it is now easy to create applications that query and analyze mutable, constantly changing datasets using SQL while getting the impressive query performance that you would normally expect from an immutable columnar data format like Parquet. Kudu delivers this with a fault-tolerant, distributed architecture and a columnar on-disk storage format.

This talk provides an introduction to Kudu, presents an overview of how to build a Spark application using Kudu for data storage, and demonstrates using Spark and Kudu together to achieve impressive results in a system that is friendly to both application developers and operations engineers.

Learn more:

  • Building Real-Time BI Systems with Kafka, Spark, and Kudu
  • Five Spark SQL Utility Functions to Extract and Explore Complex Data Types


    « back