Databricks Labs

Databricks Labsは、お客様のユースケースをより早く本番に導入できるように、フィールドチームが作成したプロジェクトです！

DQX

Simplified Data Quality checking at Scale for PySpark Workloads on streaming and standard DataFrames.

GitHub Sources →

Documentation →

Kasal

Kasal is an interactive, low-code way to build and deploy AI Agents on the Databricks platform.

Github Sources →

Documentation →

モザイク

Mosaicは、一般的なオープンソースの地理空間ライブラリを束ねることで、スケーラブルな地理空間データパイプラインの実装を簡素化するツールです。 Apache Spark™️. また、Mosaicは、一般的な地理空間のユースケースの例とベストプラクティスを提供しています。 ST_式とGRID_式のAPIを提供し、H3やBritish National Gridなどのグリッドインデックスシステムをサポートします。

GitHubのソース →

ドキュメンテーション →

ブログ→こちら

その他のプロジェクト

Databricks MCP

A collection of MCP servers to help AI agents fetch enterprise data from Databricks and automate common developer actions on Databricks.

Github Sources →

Conversational Agent App

Application featuring a chat interface powered by Databricks Genie Conversation APIs, built specifically to run as a Databricks App.

Github Sources →

Knowledge Assistant Chatbot Application

Example Databricks Knowledge Assistant chatbot application.

Github Sources →

Feature Registry Application

The app provides a user-friendly interface for exploring existing features in Unity Catalog. Additionally, users can generate code for creating feature specs and training sets to train machine learning models and deploy features as Feature Serving Endpoints.

Github Sources →

Mosaic

Mosaic is a tool that simplifies the implementation of scalable geospatial data pipelines by binding together common open source geospatial libraries and Apache Spark™️. Mosaic also provides a set of examples and best practices for common geospatial use cases. It provides APIs for ST_ expressions and GRID_ expressions, supporting grid index systems such as H3 and British National Grid.

GitHub Sources →

Documentation →

Blog →

DLT-META

このフレームワークにより、Delta Live Tablesとメタデータを使用したデータの取り込みが容易になります。 DLT-METAを使えば、一人のデータエンジニアが何千ものテーブルを簡単に管理することができます。 Databricks いくつかのお客様は、1000以上のテーブルを処理するためにDLT-METAを生産しています。

Github Sources →
Learn more →

Smolder

Smolder provides an Apache Spark™ SQL data source for loading EHR data from HL7v2 message formats. Additionally, Smolder provides helper functions that can be used on a Spark SQL DataFrame to parse HL7 message text, and to extract segments, fields, and subfields from a message.

Github Sources →
Learn more →

ジオスキャン

Hexagonal Hierarchical Spatial Indicesに基づく密度ベースの空間 cluster のためのApache Spark ML Estimator。

Github Sources →
Learn more →

移行

Databricks のワークスペース間でお客様が成果物を移行するためのツールです。これにより、お客様はバックアップとして、あるいは異なるワークスペース間のマイグレーションの一環として、設定やコードアーティファクトをエクスポートすることができます。

Github Sources
詳細はこちら： AWS ｜ Azure

Data Generator

Generate relevant data quickly for your projects. The Databricks data generator can be used to generate large simulated/synthetic data sets for test, POCs, and other uses

Github Sources →
Learn more →

DeltaOMS

Centralized Delta transaction log collection for metadata and operational metrics analysis on your Lakehouse.

Github Sources →
Learn more →

Splunk Integration

Add-on for Splunk, an app that allows Splunk Enterprise and Splunk Cloud users to run queries and execute actions, such as running notebooks and jobs, in Databricks.

Github Sources →
Learn more →

DiscoverX

DiscoverX automates administration tasks that require inspecting or applying operations to a large number of Lakehouse assets.

Github Sources →

brickster

{brickster} is the R toolkit for Databricks, it includes:

Wrappers for Databricks API's (e.g. db_cluster_list, db_volume_read)
Browser workspace assets via RStudio Connections Pane (open_workspace())
Exposes the databricks-sql-connector via {reticulate} (docs)
Interactive Databricks REPL

Github Sources →
Documentation →
Blog →

DBX

This tool simplifies jobs launch and deployment process across multiple environments. It also helps to package your project and deliver it to your Databricks environment in a versioned fashion. Designed in a CLI-first manner, it is built to be actively used both inside CI/CD pipelines and as a part of local tooling for fast prototyping.

Github Sources →
Documentation →
Blog →

Tempo

The purpose of this project is to provide an API for manipulating time series on top of Apache Spark™. Functionality includes featurization using lagged time values, rolling statistics (mean, avg, sum, count, etc.), AS OF joins, and downsampling and interpolation. This has been tested on TB-scale of historical data.

GitHub Sources →
Documentation →
Webinar →

PyLint Plugin

This plugin extends PyLint with checks for common mistakes and issues in Python code specifically in Databricks Environment.

Github Sources →
Documentation →

PyTester

PyTester is a powerful way to manage test setup and teardown in Python. This library provides a set of fixtures to help you write integration tests for Databricks.

Github Sources →
Documentation →

Delta Sharing Java Connector

The Java connector follows the Delta Sharing protocol to read shared tables from a Delta Sharing Server. To further reduce and limit egress costs on the Data Provider side, we implemented a persistent cache to reduce and limit the egress costs on the Data Provider side by removing any unnecessary reads.

GitHub Sources →

Documentation →

Overwatch

Analyze all of your jobs and clusters across all of your workspaces to quickly identify where you can make the biggest adjustments for performance gains and cost savings.

Learn more →

UCX

UCX is a toolkit for enabling Unity Catalog (UC) in your Databricks workspace. UCX provides commands and workflows for migrate tables and views to UC. UCX allows to rewrite dashboards, jobs and notebooks to use the migrated data assets in UC. And there are many more features.

GitHub Sources →

Documentation →

Blog →

なお、 https://github.com/databrickslabs のすべてのプロジェクトは Databricks のサービス・レベル・アグリーメント（SLA）で正式にサポートされるものではありません。これらは現状有姿で提供されるものであり、いかなる保証をするものではありません。これらのプロジェクトの使用に起因する問題についてのサポートチケットの提出はご遠慮ください。このプロジェクトの利用によって発見された問題は、GitHub IssuesとしてRepoに申請してください。時間の許す限り見直されますが、サポートに関する正式なSLAはありません。