Skip to main content
Page 1
>
Engineering blog

Announcing the State Reader API: The New "Statestore" Data Source

Databricks Runtime 14.3 includes a new capability that allows users to access and analyze Structured Streaming 's internal state data: the State Reader...
Engineering blog

PySpark in 2023: A Year in Review

With the releases of Apache Spark 3.4 and 3.5 in 2023, we focused heavily on improving PySpark performance, flexibility, and ease of use...
Engineering blog

Simplify PySpark testing with DataFrame equality functions

The DataFrame equality test functions were introduced in Apache Spark™ 3.5 and Databricks Runtime 14.2 to simplify PySpark unit testing. The full set...
Engineering blog

A Deep Dive into the Latest Performance Improvements of Stateful Pipelines in Apache Spark Structured Streaming

This post is the second part of our two-part series on the latest performance improvements of stateful pipelines. The first part of this...
Engineering blog

Performance Improvements for Stateful Pipelines in Apache Spark Structured Streaming

Introduction Apache Spark™ Structured Streaming is a popular open-source stream processing platform that provides scalability and fault tolerance, built on top of the...
Engineering blog

Lakehouse Monitoring: A Unified Solution for Quality of Data and AI

Introduction Databricks Lakehouse Monitoring allows you to monitor all your data pipelines – from data to features to ML models – without additional...
Engineering blog

Python Dependency Management in Spark Connect

November 14, 2023 by Hyukjin Kwon and Ruifeng Zheng in Engineering Blog
Managing the environment of an application in a distributed computing environment can be challenging. Ensuring that all nodes have the necessary environment to...
Engineering blog

Named Arguments for SQL Functions

Today, we introduce the new availability of named arguments for SQL functions. With this feature, you can invoke functions in more flexible ways...
Engineering blog

Introducing Python User-Defined Table Functions (UDTFs)

Apache Spark™ 3.5 and Databricks Runtime 14.0 have brought an exciting feature to the table: Python user-defined table functions (UDTFs). In this blog...
Engineering blog

Arrow-optimized Python UDFs in Apache Spark™ 3.5

In Apache Spark™, Python User-Defined Functions (UDFs) are among the most popular features. They empower users to craft custom code tailored to their...