Engineering blog

Automate Your Data and ML Workflows With GitHub Actions for Databricks

Ahmed Bilal
Sid Murching
Mohamad Arabi
Xiangrui Meng
Share this post

As demand for data and machine learning (ML) applications grows, businesses are adopting continuous integration and deployment practices to ensure they can deploy reliable data and AI workflows at scale. Today we are announcing the first set of GitHub Actions for Databricks, which make it easy to automate the testing and deployment of data and ML workflows from your preferred CI/CD provider. For example, you can run integration tests on pull requests, or you can run an ML training pipeline on pushes to main. By automating your workflows, you can improve developer productivity, accelerate deployment and create more value for your end-users and organization.

GitHub Actions for Databricks simplify CI/CD workflows

Today, teams spend significant time setting up CI/CD pipelines for their data and AI workloads. Crafting these CI/CD pipelines can be a painstaking process and requires stitching together multiple APIs, creating custom plugins, and then maintaining these plugins. GitHub Actions for Databricks are first-party actions that provide a simple and easy way to run Databricks notebooks from GitHub Actions workflows. With the release of these actions, you can now easily create and manage automation workflows for Databricks.

What can you do with GitHub Actions for Databricks?

We are launching two new GitHub Actions in the GitHub marketplace that will help data engineers and scientists run notebooks directly from GitHub.

You can use the actions to run notebooks from your repo in a variety of ways. For example, you can use them to perform the following tasks:

  • Run a notebook on Databricks from the current repo and await its completion
  • Run a notebook using library dependencies in the current repo and on PyPI
  • Run an existing notebook in the Databricks Workspace
  • Run notebooks against different workspaces - for example, run a notebook against a staging workspace and then run it against a production workspace
  • Run multiple notebooks in series, including passing the output of a notebook as the input to another notebook

Below is an example of how to use the newly introduced action to run a notebook in Databricks from GitHub Actions workflows.


name: Run a notebook in databricks on PRs

on:
 pull_request:

jobs:
 run-databricks-notebook:
   runs-on: ubuntu-latest
   steps:
     - name: Checkout repo
       uses: actions/checkout@v2
     - name: Run a databricks notebook
       uses: databricks/run-notebook@v0
       with:
         local-notebook-path: path/to/my/databricks_notebook.py
         databricks-host: https://adb-XXXX.XX.dev.azuredatabricks.net
         databricks-token: ${{ secrets.DATABRICKS_TOKEN }}
         git-commit: ${{ github.event.pull_request.head.sha }}
         new-cluster-json: >
           {
             "num_workers": 1,
             "spark_version": "10.4.x-scala2.12",
             "node_type_id": "Standard_D3_v2"
           }

Get started with the GitHub Actions for Databricks

Ready to get started or try it out for yourself? You can read more about GitHub Actions for Databricks and how to use them in our documentation: Continuous integration and delivery on Databricks using GitHub Actions.

Try Databricks for free

Related posts

Engineering blog

Automate continuous integration and continuous delivery on Databricks using Databricks Labs CI/CD Templates

CONTENTS Overview Why do we need yet another deployment framework? Simplifying CI/CD on Databricks via reusable templates Development lifecycle using Databricks Deployments How...
Platform blog

Databricks Repos Is Now Generally Available - New ‘Files’ Feature in Public Preview

October 7, 2021 by Ka-Hing Cheung and Vaibhav Sethi in Platform Blog
Thousands of Databricks customers have adopted Databricks Repos since its public preview and have standardized on it for their development and production workflows...
Engineering blog

Implementing MLOps on Databricks using Databricks notebooks and Azure DevOps, Part 2

January 5, 2022 by Piotr Majer and Michael Shtelma in Engineering Blog
This is the second part of a two-part series of blog posts that show an end-to-end MLOps framework on Databricks, which is based...
See all Engineering Blog posts