Skip to main content
Platform blog

For your data-centered workloads, Databricks offers the best-in-class development experience and gives you the tools you need to adhere to code development best practices. Utilizing Git for version control, collaboration, and CI/CD is one such best practice. Customers can work with their Git repositories in Databricks via the 'Repos' feature which provides a visual Git client that supports common Git operations such as cloning, committing and pushing, pulling, branch management, visual comparison of diffs and more.

Clone only the content you need

Today, we are happy to share that Databricks Repos now supports Sparse Checkout, a client-side setting that allows you to clone and work with only a subset of your repositories' directories in Databricks. This is especially useful when working with monorepos. A monrepo is a single repository that holds all your organization's code and can contain many logically independent projects managed by different teams. Monorepos can often get pretty large and beyond the size of Databricks Repos supported limits.

With Sparse Checkout you can clone only the content you need to work on in Databricks, such as an ETL pipeline or a machine learning model training code, while leaving out the irrelevant parts, such as your mobile app codebase. By cloning only the relevant portion of your code base, you can stay within Databricks Repos limits and reduce clutter from unnecessary content.

Getting started

Using Sparse Checkout is simple:

  1. First, you will need to add your Git provider personal access token (PAT) token to Databricks which can be done in the UI via Settings > User Settings > Git Integration or programmatically via the Databricks Git credentials API
  2. Next, create a Repo, and check the 'Sparse checkout mode' under Advanced settings

Sparse checkout mode

  1. Specify the pattern you want to include in the clone

To illustrate Sparse Checkout, consider this sample repository with following directory structure

├── CONTRIBUTING.md
├── LICENSE.md
├── README.md
├── RUNME.md
├── SECURITY.md
├── config
│   ├── application.yaml
│   ├── configure_notebook.py
│   ├── portfolio.txt
│   └── stopwords.txt
├── images
│   ├── 1_heatmap.png
│   ├── 1_hyperopts_lda.png
│   ├── 1_scores.png
│   ├── 1_wordcloud.png
│   ├── 2_heatmap.png
│   ├── 2_scores.png
│   ├── 2_walktalk.png
│   ├── fs-lakehouse-logo-transparent.png
│   ├── fs-lakehouse-logo.png
│   ├── news_contribution.png
│   └── reference_architecture.png
├── notebooks
│   ├── data_prep
│   │   ├── 00_esg_context.py
│   │   └── 01_csr_download.py
│   └── scoring
│       ├── 02_csr_scoring.py
│       ├── 03_gdelt_download.py
│       └── 04_gdelt_scoring.py
├── requirements.txt
├── tests
│   ├── __init__.py
│   └── tests_utils.py
├── tf
│   └── modules
│       └── databricks-department-clusters
│           ├── README.md
│           ├── cluster-policies.tf
│           ├── clusters.tf
│           ├── main.tf
│           ├── provider.tf
│           ├── sql-endpoint.tf
│           ├── users-groups.tf
│           └── variables.tf
└── utils
    ├── __init__.py
    ├── gdelt_download.py
    ├── nlp_utils.py
    ├── scraper_utils.py
    └── spark_utils.py

Now say you want to only clone a subset of this repository in Databricks, say the following folders 'notebooks/data_prep', 'utils' and 'tests'. To do so, you can specify these patterns separated by newline when creating the Repo.

Repository in Databricks

This will result in inclusion of the directories and files in the clone, as shown in image below. Files in the repo root and contents in 'tests' and 'utils' folders are included. Since we specified 'notebooks/data_prep' in the pattern above only this folder is included; 'notebooks/scoring' is not cloned. Databricks Repos supports 'Cone Patterns' for defining sparse checkout patterns. See more examples in our documentation. For more details about the cone pattern see Git's documentation or this GitHub blog

repo root

You can also perform the above steps via Repos API. For example, to create a Repo with the above Sparse Checkout pattern you make the following API call:

POST /api/2.0/repos

{
  "url": "https://github.com/vaibhavsethi-db/esg-scoring",
  "provider": "gitHub",
  "path": "/Repos/[]/[]/esg-scoring",
  "sparse_checkout": {
    "patterns": ["notebook/data_prep", "tests", "utils"]
  }	
}
  1. Edit code and perform Git operations

    You can now edit existing files, commit and push them, and perform other Git operations from the Repos interface. When creating new folders of files you should make sure they are included in the cone pattern you had specified for that repo.

    Including a new folder outside of the cone pattern results in an error during the commit and push operation. To rectify it, edit the cone pattern from your Repo settings to include the new folder you are trying to commit and push.

Ready to get started? Dive deeper into the Databricks Repos documentation and give it a try!

Try Databricks for free

Related posts

Engineering blog

Build Reliable Production Data and ML Pipelines With Git Support for Databricks Workflows

We are happy to announce native support for Git in Databricks Workflows , which enables our customers to build reliable production data and...
Engineering blog

Automate Your Data and ML Workflows With GitHub Actions for Databricks

As demand for data and machine learning (ML) applications grows, businesses are adopting continuous integration and deployment practices to ensure they can deploy...
Platform blog

Software Engineering Best Practices With Databricks Notebooks

June 25, 2022 by Rafi Kurlansik and Austin Ford in Product
Notebooks are a popular way to start working with data quickly without configuring a complicated environment. Notebook authors can quickly go from interactive...
See all Platform Blog posts