What is Spark SQL?

How Spark SQL lets you query structured data with familiar SQL while leveraging Spark's performance, scalability, and data ecosystem integration

by Databricks Staff

Spark SQL is the Spark module that lets you query large structured datasets with familiar SQL while taking advantage of Spark performance and scale.
Spark SQL uses DataFrames, a cost based optimizer, columnar storage and code generation to run SQL queries quickly across clusters.
Spark SQL integrates with the rest of the Spark ecosystem so you can mix SQL with machine learning and other analytics in one application on the Databricks lakehouse.

Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data. It also provides powerful integration with the rest of the Spark ecosystem (e.g., integrating SQL query processing with machine learning).

What is Apache Spark SQL?

Spark SQL brings native support for SQL to Spark and streamlines the process of querying data stored both in RDDs (Spark’s distributed datasets) and in external sources. Spark SQL conveniently blurs the lines between RDDs and relational tables. Unifying these powerful abstractions makes it easy for developers to intermix SQL commands querying external data with complex analytics, all within in a single application. Concretely, Spark SQL will allow developers to:

Import relational data from Parquet files and Hive tables
Run SQL queries over imported data and existing RDDs
Easily write RDDs out to Hive tables or Parquet files

Spark SQL also includes a cost-based optimizer, columnar storage, and code generation to make queries fast. At the same time, it scales to thousands of nodes and multi-hour queries using the Spark engine, which provides full mid-query fault tolerance, without having to worry about using a different engine for historical data.

Additional Resources

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.

View all blogs

What is Apache Spark SQL?

The agentic AI playbook for the enterprise

Additional Resources

Get the latest posts in your inbox

Sign up