Skip to main content
Page 1
Platform blog

Announcing General Availability of Next-Generation Lakeview Dashboards

The next generation of Databricks SQL (DBSQL) dashboards, also known as Lakeview Dashboards, is now generally available on AWS and Azure. This new...
Engineering blog

Creating a bespoke LLM for AI-generated documentation

We recently announced our AI-generated documentation feature , which uses large language models (LLMs) to automatically generate documentation for tables and columns in...
Platform blog

Data Intelligence Platforms

The observation that " software is eating the world " has shaped the modern tech industry. Today, software is ubiquitous in our lives...
Platform blog

Databricks + Arcion: Real-time enterprise data replication to the Lakehouse

We are excited to announce that we have completed our acquisition of Arcion , a leading provider for real-time data replication technologies. Arcion’s...
Platform blog

Introducing Predictive Optimization: Faster Queries, Cheaper Storage, No Sweat

Predictive Optimization intelligently optimizes your Lakehouse table data layouts for peak performance and cost-efficiency - without you needing to lift a finger.
Platform blog

Announcing the Public Preview of Lakeview Dashboards!

We are excited to announce the public preview of the next generation of Databricks SQL dashboards, dubbed Lakeview dashboards . Available today, this...
Engineering blog

Introducing Apache Spark™ 3.5

Today, we are happy to announce the availability of Apache Spark™ 3.5 on Databricks as part of Databricks Runtime 14.0. We extend our...
Company blog

Announcing Databricks Belgrade Development Center

August 1, 2023 by Reynold Xin and Vinod Marur in Company Blog
We are thrilled to announce the opening of Databricks’ latest development center in Belgrade, Serbia. This addition joins our existing R&D centers in...
Company blog

Databricks + MosaicML

Today, we’re excited to share that we’ve completed our acquisition of MosaicML, a leading platform for creating and customizing generative AI models for...
Engineering blog

Introducing English as the New Programming Language for Apache Spark

Introduction We are thrilled to unveil the English SDK for Apache Spark, a transformative tool designed to enrich your Spark experience. Apache Spark™...
Engineering blog

Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering

We are excited to announce Delta Lake 3.0, the next major release of the Linux Foundation open source Delta Lake Project, available in...
Engineering blog

Project Lightspeed Update - Advancing Apache Spark Structured Streaming

In this blog post, we will review the advancements in Spark Structured Streaming since we announced Project Lightspeed a year ago, from performance...
Company blog

Welcome Rubicon to Databricks: The Future of AI Storage and Serving Systems

June 13, 2023 by Reynold Xin in Company Blog
We are incredibly excited to announce that the team behind Rubicon is joining Databricks. Founded by large scale infrastructure builders, Akhil Gupta and...
Company blog

Welcome Okera: Adopting an AI-centric approach to governance

For a decade, Databricks has focused on democratizing data and AI for organizations around the world. And since the debut of ChatGPT last...
Engineering blog

Spark Connect Available in Apache Spark 3.4

Last year Spark Connect was introduced at the Data and AI Summit. As part of the recently released Apache SparkTM 3.4, Spark Connect...
Engineering blog

Introducing Apache Spark™ 3.4 for Databricks Runtime 13.0

Today, we are happy to announce the availability of Apache Spark™ 3.4 on Databricks as part of Databricks Runtime 13.0 . We extend...
Company blog

Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM

Two weeks ago, we released Dolly , a large language model (LLM) trained for less than $30 to exhibit ChatGPT-like human interactivity (aka...
Company blog

Welcoming Datajoy to Databricks

October 13, 2022 by Reynold Xin, Chao Cai and Adam Conway in Company Blog
We are excited to announce that Datajoy will be joining Databricks to accelerate our mission to unlock the value of data for everyone...
Engineering blog

Introducing Spark Connect - The Power of Apache Spark, Everywhere

At last week's Data and AI Summit, we highlighted a new project called Spark Connect in the opening keynote. This blog post walks...
Company blog

Open Sourcing All of Delta Lake

The theme of this year's Data + AI Summit is that we are building the modern data stack with the lakehouse. A fundamental...
Engineering blog

Connect From Anywhere to Databricks SQL

Today we are thrilled to announce a full lineup of open source connectors for Go , Node.js , Python , as well as...
Engineering blog

Project Lightspeed: Faster and Simpler Stream Processing With Apache Spark

Streaming data is a critical area of computing today. It is the basis for making quick decisions on the enormous amounts of incoming...
Platform blog

Databricks SQL Serverless Now Available on AWS

Databricks SQL Serverless is now generally available. Read our blog to learn more. We are excited to announce the availability of serverless compute...
Company blog

Apache Spark and Photon Receive SIGMOD Awards

June 15, 2022 by Reynold Xin and Matei Zaharia in Company Blog
This week, many of the most influential engineers and researchers in the data management community are convening in-person in Philadelphia for the ACM...
Engineering blog

Introducing Apache Spark™ 3.3 for Databricks Runtime 11.0

Today we are happy to announce the availability of Apache Spark™ 3.3 on Databricks as part of Databricks Runtime 11.0 . We want...
Platform blog

Introducing Databricks SQL on Google Cloud – Now in Public Preview

Today we’re pleased to announce the availability of Databricks SQL in public preview on Google Cloud . With this announcement, customers can further...
Platform blog

Announcing General Availability of Databricks SQL

Today, we are thrilled to announce that Databricks SQL is Generally Available (GA)! This follows our earlier announcements about Databricks SQL’s world record-setting...
Company blog

Announcing Databricks Seattle R&D Site

November 22, 2021 by Reynold Xin and Vinod Marur in Company Blog
Today, we are excited to announce the opening of our Seattle R&D site and our plan to hire hundreds of engineers in Seattle...
Platform blog

Evolution of the Sql Language at Databricks: Ansi Standard by Default and Easier Migrations from Data Warehouses

Databricks SQL is now generally available on AWS and Azure. Today, we are excited to announce that Databricks SQL will use the ANSI...
Company blog

Snowflake Claims Similar Price/Performance to Databricks, but Not So Fast!

On Nov 2, 2021, we announced that we set the official world record for the fastest data warehouse with our Databricks SQL lakehouse...
Company blog

Announcing Databricks Engineering Fellowship

November 10, 2021 by Reynold Xin, Alexa Friedman and Sam Shah in Company Blog
We are excited to announce a new program called Databricks Engineering Fellowship to recognize new graduates with exceptional academic achievements or extracurricular impact...
Company blog

Eliminating the Anti-competitive DeWitt Clause for Database Benchmarking

November 8, 2021 by Justin Olsson and Reynold Xin in Company Blog
At Databricks, we often use the phrase “the future is open” to refer to technology; it reflects our belief that open data architecture...
Company blog

Databricks Sets Official Data Warehousing Performance Record

November 2, 2021 by Reynold Xin and Mostafa Mokhtar in Company Blog
Today, we are proud to announce that Databricks SQL has set a new world record in 100TB TPC-DS , the gold standard performance...
Company blog

Simplifying Data + AI, One Line of TypeScript at a Time

October 21, 2021 by Reynold Xin and Matei Zaharia in Culture
Today, Databricks is known for our backend engineering, building and operating cloud systems that span millions of virtual machines processing exabytes of data...
Engineering blog

Introducing Apache Spark™ 3.2

We are excited to announce the availability of Apache Spark™ 3.2 on Databricks as part of Databricks Runtime 10.0 . We want to...
Platform blog

New Performance Improvements in Databricks SQL

Databricks SQL is now generally available on AWS and Azure. Originally announced at Data + AI Summit 2020 Europe, Databricks SQL lets you...
Platform blog

Frequently Asked Questions About the Data Lakehouse

Question Index What is a Data Lakehouse? What is a Data Lake? What is a Data Warehouse? How is a Data Lakehouse different...
Engineering blog

How We Achieved High-bandwidth Connectivity With BI Tools

Databricks SQL is now generally available on AWS and Azure. Business Intelligence (BI) tools such as Tableau and Microsoft Power BI are notoriously...
Engineering blog

Introducing Apache Spark™ 3.1

We are excited to announce the availability of Apache Spark 3.1 on Databricks as part of Databricks Runtime 8.0 . We want to...
Company blog

Spark + AI Summit Europe is Expanding and Getting a New Name: Data + AI Summit Europe

September 2, 2020 by Ali Ghodsi, Reynold Xin and Matei Zaharia in Company Blog
Back in 2013, we held the first Spark Summit — a gathering of the Apache Spark™ community with leading contributors and production users...
Company blog

Welcoming Redash to Databricks

June 24, 2020 by Reynold Xin in Company Blog
This morning at Spark and AI Summit, we announced that Databricks has acquired Redash, the company behind the popular open source project of...
Company blog

Introducing Apache Spark 3.0

We’re excited to announce that the Apache Spark TM 3.0.0 release is available on Databricks as part of our new Databricks Runtime 7.0...
Platform blog

Evolving the Databricks brand

Some brands start out as, well, brands. A lot of work goes into the concept and painting the picture before the business is...
Engineering blog

What Is a Lakehouse?

Read Building the Data Lakehouse to explore why lakehouses are the data architecture of the future with the father of the data warehouse...
Company blog

Solving the World’s Toughest Problems with the Growing Open Source Ecosystem and Databricks

January 23, 2020 by Reynold Xin in Company Blog
We started Databricks in 2013 in a tiny little office in Berkeley with the belief that data has the potential to solve the...
Company blog

Delta Lake Now Hosted by the Linux Foundation to Become the Open Standard for Data Lakes

October 16, 2019 by Michael Armbrust and Reynold Xin in Company Blog
Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake. At today’s Spark +...
Engineering blog

Introducing Brickchain: Planet-scale Unified Analytics

April 1, 2019 by Bart Samwel and Reynold Xin in Engineering Blog
Today we are excited to announce Brickchain , the next generation technology for zettabyte-scale analytics, by harnessing all the compute power on the...
Engineering blog

Introducing Apache Spark 2.4

November 8, 2018 by Wenchen Fan, Xiao Li and Reynold Xin in Engineering Blog
UPDATED: 11/19/2018 We are excited to announce the availability of Apache Spark 2.4 on Databricks as part of the Databricks Runtime 5.0...
Engineering blog

Benchmarking Apache Spark on a Single Node Machine

Apache Spark has become the de facto unified analytics engine for big data processing in a distributed environment. Yet we are seeing more...
Engineering blog

Introducing Apache Spark 2.3

Today we are happy to announce the availability of Apache Spark 2.3.0 on Databricks as part of its Databricks Runtime 4.0. We want...
Engineering blog

Meltdown and Spectre: Exploits and Mitigation Strategies

In an earlier blog post , we analyzed the performance impact of Meltdown and Spectre on big data workloads in the cloud. In...
Engineering blog

Meltdown and Spectre's Performance Impact on Big Data Workloads in the Cloud

Last week, the details of two industry-wide security vulnerabilities, known as Meltdown and Spectre , were released. These exploits enable cross-VM and cross-process...
Company blog

Databricks Cache Boosts Apache Spark Performance

We are excited to announce the general availability of Databricks Cache, a Databricks Runtime feature as part of the Unified Analytics Platform that...
Engineering blog

Benchmarking Big Data SQL Platforms in the Cloud

For a deeper dive on these benchmarks, watch the webinar featuring Reynold Xin. Performance is often a key factor in choosing big data...
Engineering blog

A Vision for Making Deep Learning Simple

Try this notebook on Databricks When MapReduce was introduced 15 years ago, it showed the world a glimpse into the future. For the...
Company blog

Top 5 Reasons for Choosing S3 over HDFS

At Databricks, our engineers guide thousands of organizations to define their big data and cloud strategies. When migrating big data workloads to the...
Company blog

Databricks Runtime 3.0 Beta Delivers Cloud Optimized Apache Spark

May 24, 2017 by Reynold Xin in Company Blog
A major value Databricks provides is the automatic provisioning, configuration, and tuning of clusters of machines that process data. Running on these machines...
Engineering blog

Processing a Trillion Rows Per Second on a Single Machine: How Can Nested Loop Joins be this Fast?

This blog post describes our experience debugging a failing test case caused by a cross join query running “too fast.” Because the root...
Company blog

Databricks and Apache Spark 2016 Year in Review

Spark Summit will be held in Boston on Feb 7-9, 2017. Check out the full agenda and get your ticket before it sells...
Engineering blog

Introducing Apache Spark 2.1

December 29, 2016 by Reynold Xin in Engineering Blog
Spark Summit will be held in Boston on Feb 7-9, 2017. Check out the full agenda and get your ticket before it sells...
Engineering blog

$1.44 per terabyte: setting a new world record with Apache Spark

November 14, 2016 by Reynold Xin in Engineering Blog
We are excited to share with you that a joint effort by Nanjing University, Alibaba Group, and Databricks set a new world record...
Engineering blog

Spark Structured Streaming

Apache Spark 2.0 adds the first version of a new higher-level API, Structured Streaming, for building continuous applications . The main goal is...
Engineering blog

Introducing Apache Spark 2.0

Today, we're excited to announce the general availability of Apache Spark 2.0 on Databricks. This release builds on what the community has learned...
Engineering blog

Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop

When our team at Databricks planned our contributions to the upcoming Apache Spark 2.0 release, we set out with an ambitious goal by...
Engineering blog

Technical Preview of Apache Spark 2.0 Now on Databricks

May 11, 2016 by Reynold Xin in Engineering Blog
For the past few months, we have been busy contributing to the next major release of the big data open source software we...
Engineering blog

The Unreasonable Effectiveness of Deep Learning on Apache Spark

April 1, 2016 by Miles Yucht and Reynold Xin in Engineering Blog
Update: this post is an April Fools joke. It is not an actual project we're working on. For the past three years, our...
Engineering blog

Apache Spark Trending in the Stack Overflow Survey

March 22, 2016 by Reynold Xin in Engineering Blog
Last week, Stack Overflow released the result of their 2016 developer survey . This is one of the most significant surveys in the...
Engineering blog

Apache Spark 2015 Year In Review

To learn more about Apache Spark, attend Spark Summit East in New York in Feb 2016 . 2015 has been a year of...
Engineering blog

Announcing Apache Spark 1.6

To learn more about Apache Spark, attend Spark Summit East in New York in Feb 2016 . Today we are happy to announce...
Engineering blog

Introducing Apache Spark Datasets

Developers have always loved Apache Spark for providing APIs that are simple yet powerful, a combination of traits that makes complex analysis possible...
Company blog

Announcing an Apache Spark 1.6 Preview in Databricks

Today we are happy to announce the availability of an Apache Spark 1.6 preview package in Databricks. The Apache Spark 1.6.0 release is...
Engineering blog

Apache Spark 1.5.1 and What do Version Numbers Mean?

October 1, 2015 by Reynold Xin in Engineering Blog
The inaugural Spark Summit Europe will be held in Amsterdam on October 27 - 29. Check out the full agenda and get your...
Engineering blog

Apache Spark 1.5 DataFrame API Highlights: Date/Time/String Handling, Time Intervals, and UDAFs

To try new features highlighted in this blog post, download Spark 1.5 or sign up Databricks for a 14-day free trial today...
Engineering blog

Announcing Apache Spark 1.5

September 9, 2015 by Reynold Xin and Patrick Wendell in Engineering Blog
The inaugural Spark Summit Europe will be held in Amsterdam this October. Check out the full agenda and get your ticket before it...
Company blog

Apache Spark 1.5 Preview Now Available in Databricks

August 18, 2015 by Reynold Xin and Michael Lumb in Company Blog
We are excited to announce that starting today, Apache Spark 1.5.0 is available as a preview in Databricks. Our users can now choose...
Engineering blog

Statistical and Mathematical Functions with DataFrames in Apache Spark

We introduced DataFrames in Apache Spark 1.3 to make Apache Spark much easier to use. Inspired by data frames in R and Python...
Engineering blog

Project Tungsten: Bringing Apache Spark Closer to Bare Metal

April 28, 2015 by Reynold Xin and Josh Rosen in Engineering Blog
In a previous blog post , we looked back and surveyed performance improvements made to Apache Spark in the past year. In this...
Engineering blog

Recent performance improvements in Apache Spark: SQL, Python, DataFrames, and More

April 24, 2015 by Reynold Xin in Engineering Blog
Read Rise of the Data Lakehouse to explore why lakehouses are the data architecture of the future with the father of the data...
Engineering blog

Deep Dive into Spark SQL's Catalyst Optimizer

Check out the Why the Data Lakehouse is Your Next Data Warehouse ebook to discover the inner workings of the Databricks Lakehouse Platform...
Engineering blog

Apache Spark 2.0: Rearchitecting Spark for Mobile Platforms

April 1, 2015 by Reynold Xin in Engineering Blog
Yesterday, to celebrate Apache Spark’s 5 year old birthday, we looked back at the history of the project. Today, we are happy to...
Engineering blog

Introducing DataFrames in Apache Spark for Large Scale Data Science

Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience. When...
Engineering blog

Apache Spark Officially Sets a New Record in Large-Scale Sorting

November 5, 2014 by Reynold Xin in Engineering Blog
A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system...
Engineering blog

Apache Spark the Fastest Open Source Engine for Sorting a Petabyte

October 10, 2014 by Reynold Xin in Engineering Blog
Update November 5, 2014 : Our benchmark entry has been reviewed by the benchmark committee and Apache Spark has won the Daytona GraySort...
Engineering blog

Scalable Collaborative Filtering with Apache Spark MLlib

July 23, 2014 by Burak Yavuz and Reynold Xin in Engineering Blog
Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company's customer base. In this blog post, we discuss how Apache Spark MLlib enables building recommendation models from billions of records in just a few lines of Pyt
Engineering blog

Shark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark

July 1, 2014 by Reynold Xin in Engineering Blog
With the introduction of Spark SQL and the new Hive on Apache Spark effort ( HIVE-7292 ), we get asked a lot about...
Engineering blog

Spark SQL: Manipulating Structured Data Using Apache Spark

Read Rise of the Data Lakehouse to explore why lakehouses are the data architecture of the future with the father of the data...
Engineering blog

AMPLab updates the Big Data Benchmark

February 12, 2014 by Ahir Reddy and Reynold Xin in Engineering Blog
The AMPLab at UC Berkeley, with help from Databricks, recently released an update to the Big Data Benchmark . This benchmark uses Amazon...