The Public Preview of Apache Spark MLlib (Python) and Optuna on serverless notebooks and jobs, as well as standard clusters, brings distributed machine learning to Databricks’ unified compute environments, combining performance, security, and ease of collaboration without the need for dedicated clusters.
Until now, distributed ML workloads such as training with Apache Spark MLlib or hyperparameter tuning with Optuna could only run on dedicated clusters. While effective, dedicated clusters are single-identity environments (user or group) that lack native fine-grained access control (FGAC), limiting secure multi-user collaboration.
With this release, Databricks extends distributed ML capabilities to both serverless and standard clusters, allowing teams to scale their machine learning workloads with built-in security and governance.
These enhancements complement existing single-node ML support, including Scikit-learn, XGBoost, and LightGBM, delivering a unified, end-to-end machine learning experience across all Databricks compute options.
Databricks users can now run distributed ML workloads on both serverless and standard clusters, including:
Together, these capabilities unify the ML experience, enabling teams to scale seamlessly from local experimentation to distributed production workloads.
Databricks’ Lakeguard technology, built on Spark Connect, powers both standard and serverless compute with fine-grained access control (FGAC) and multi-user isolation. This helps ensure that data and workloads are protected under the same governance layer, whether you manage your own clusters or rely on serverless compute.
Key benefits include:
These capabilities, introduced in Spark 4 and now integrated into Databricks to deliver the next generation of distributed machine learning for modern data teams.
Hozumi Nakamo, Product Manager at SAP, shared:
"Apache Spark MLlib's availability on Databricks serverless compute empowers SAP Databricks customers to scale machine learning without infrastructure headaches, making it easy to unlock insights from business data securely and efficiently."
This reflects how Databricks serverless compute simplifies distributed ML — allowing customers to focus on insights rather than infrastructure.
This milestone reflects Databricks’ continued collaboration with the open source community, including work with NVIDIA, a long-time contributor to Apache Spark. Together, Databricks and NVIDIA expanded Spark ML to Spark Connect as part of the Spark 4 release, enabling distributed ML workloads to run efficiently on both standard and serverless compute.
Andrew Feng, Vice President of GPU Software at NVIDIA, shared:
"Spark Connect represents a new era of accessibility and ease of adoption for Spark users. NVIDIA has been active in the open source Spark community for more than seven years. By extending Spark MLlib with support on Spark Connect, enterprises can now achieve effortless, end-to-end GPU acceleration with no code changes - delivering breakthrough performance gains of up to 9x while reducing costs by as much as 80%. This is the architecture we’ve adopted within NVIDIA and have helped enterprises transition to as well. It’s redefining what’s possible with data and AI at scale."
Through this collaboration with NVIDIA and the broader Spark community, Databricks continues to make distributed ML more performant, accessible, and cost-effective for every enterprise.
You can start running distributed ML on Databricks today:
Learn more:
