Skip to main content
Engineering blog

Managing the environment of an application in a distributed computing environment can be challenging. Ensuring that all nodes have the necessary environment to execute code and determining the actual location of the user's code are complex tasks. Apache Spark™ offers various methods such as Conda, venv, and PEX; see also How to Manage Python Dependencies in PySpark as well as submit script options like --jars, --packages, and Spark configurations like spark.jars.*. These options allow users to seamlessly handle dependencies in their clusters.

However, the current support for managing dependencies in Apache Spark has limitations. Dependencies can only be added statically and cannot be changed during runtime. This means that you must always set the dependencies before starting your Driver. To address this issue, we have introduced session-based dependency management support in Spark Connect, starting from Apache Spark 3.5.0. This new feature allows you to update Python dependencies dynamically during runtime. In this blog post, we will discuss the comprehensive approach to controlling Python dependencies during runtime using Spark Connect in Apache Spark.

Session-based Artifacts in Spark Connect

Spark Context
One environment for each Spark Context

When using the Spark Driver without Spark Connect, the Spark Context adds the archive (user environment) which is later automatically unpacked on the nodes, guaranteeing that all nodes possess the necessary dependencies to execute the job. This functionality simplifies dependency management in a distributed computing environment, minimizing the risk of environment contamination and ensuring that all nodes have the intended environment for execution. However, this can only be set once statically before starting the Spark Context and Driver, limiting flexibility.

Spark Session
Separate environment for each Spark Session

With Spark Connect, dependency management becomes more intricate due to the prolonged lifespan of the connect server and the possibility of multiple sessions and clients - each with its own Python versions, dependencies, and environments. The proposed solution is to introduce session-based archives. In this approach, each session has a dedicated directory where all related Python files and archives are stored. When Python workers are launched, the current working directory is set to this dedicated directory. This guarantees that each session can access its specific set of dependencies and environments, effectively mitigating potential conflicts.

Using Conda

Conda is a highly popular Python package management system many utilize. PySpark users can leverage Conda environments directly to package their third-party Python packages. This can be achieved by leveraging conda-pack, a library designed to create relocatable Conda environments.

The following example demonstrates creating a packed Conda environment that is later unpacked in both the driver and executor to enable session-based dependency management. The environment is packed into an archive file, capturing the Python interpreter and all associated dependencies.

import conda_pack
import os

# Pack the current environment ('pyspark_conda_env') to 'pyspark_conda_env.tar.gz'.
# Or you can run 'conda pack' in your shell.

conda_pack.pack()
spark.addArtifact(
    f"{os.environ.get('CONDA_DEFAULT_ENV')}.tar.gz#environment",
    archive=True)
spark.conf.set(
    "spark.sql.execution.pyspark.python", "environment/bin/python")

# From now on, Python workers on executors use the `pyspark_conda_env` Conda 
# environment.

Using PEX

Spark Connect supports using PEX to bundle Python packages together. PEX is a tool that generates a self-contained Python environment. It functions similarly to Conda or virtualenv, but a .pex file is an executable on its own.

In the following example, a .pex file is created for both the driver and executor to utilize for each session. This file incorporates the specified Python dependencies provided through the pex command.

# Pack the current env to pyspark_pex_env.pex'.
pex $(pip freeze) -o pyspark_pex_env.pex

After you create the .pex file, you can now ship them to the session-based environment so your session uses the isolated .pex file.

spark.addArtifact("pyspark_pex_env.pex",file=True)
spark.conf.set(
    "spark.sql.execution.pyspark.python", "pyspark_pex.env.pex")

# From now on, Python workers on executors use the `pyspark_conda_env` venv environment.

Using Virtualenv

Virtualenv is a Python tool to create isolated Python environments. Since Python 3.3.0, a subset of its features has been integrated into Python as a standard library under the venv module. The venv module can be leveraged for Python dependencies by using venv-pack in a similar way as conda-pack. The example below demonstrates session-based dependency management with venv.

import venv_pack
import os

# Pack the current venv to 'pyspark_conda_env.tar.gz'.
# Or you can run 'venv-pack' in your shell.

venv_pack.pack(output='pyspark_venv.tar.gz')
spark.addArtifact(
    "pyspark_venv.tar.gz#environment",
    archive=True)
spark.conf.set(
    "spark.sql.execution.pyspark.python", "environment/bin/python")

# From now on, Python workers on executors use your venv environment.

Conclusion

Apache Spark offers multiple options, including Conda, virtualenv, and PEX, to facilitate shipping and management of Python dependencies with Spark Connect dynamically during runtime in Apache Spark 3.5.0, which overcomes the limitation of static Python dependency management.

In the case of Databricks notebooks, we provide a more elegant solution with a user-friendly interface for Python dependencies to address this problem. Additionally, users can directly utilize pip and Conda for Python dependency management. Take advantage of these features today with a free trial on Databricks.

Try Databricks for free

Related posts

Engineering blog

How to Manage Python Dependencies in PySpark

December 22, 2020 by Hyukjin Kwon in Engineering Blog
Controlling the environment of an application is often challenging in a distributed computing environment - it is difficult to ensure all nodes have...
Engineering blog

Introducing Spark Connect - The Power of Apache Spark, Everywhere

At last week's Data and AI Summit, we highlighted a new project called Spark Connect in the opening keynote. This blog post walks...
Engineering blog

Introducing Apache Spark™ 3.5

Today, we are happy to announce the availability of Apache Spark™ 3.5 on Databricks as part of Databricks Runtime 14.0. We extend our...
See all Engineering Blog posts