Databricks Labs

Databricks Labs는 고객이 사용 사례를 프로덕션에 더 빠르게 적용할 수 있도록 현장 팀에서 만든 프로젝트입니다.

DQX

Simplified Data Quality checking at Scale for PySpark Workloads on streaming and standard DataFrames.

GitHub Sources →

Documentation →

Kasal

Kasal is an interactive, low-code way to build and deploy AI Agents on the Databricks platform.

Github Sources →

Documentation →

모자이크

Mosaic은 일반적인 오픈 소스 지리 공간 라이브러리와 Apache Spark™️를 함께 바인딩하여 확장 가능한 지리 공간 데이터 파이프라인의 구현을 간소화하는 도구입니다. Mosaic은 또한 일반적인 지리 공간 사용 사례에 대한 일련의 예제와 모범 사례를 제공합니다. ST_ 표현식 및 GRID_ 표현식을 위한 API를 제공하여 H3 및 British National Grid와 같은 그리드 인덱스 시스템을 지원합니다.

GitHub 소스 →

문서 →

블로그 →

기타 프로젝트

Databricks MCP

A collection of MCP servers to help AI agents fetch enterprise data from Databricks and automate common developer actions on Databricks.

Github Sources →

Conversational Agent App

Application featuring a chat interface powered by Databricks Genie Conversation APIs, built specifically to run as a Databricks App.

Github Sources →

Knowledge Assistant Chatbot Application

Example Databricks Knowledge Assistant chatbot application.

Github Sources →

Feature Registry Application

The app provides a user-friendly interface for exploring existing features in Unity Catalog. Additionally, users can generate code for creating feature specs and training sets to train machine learning models and deploy features as Feature Serving Endpoints.

Github Sources →

Mosaic

Mosaic is a tool that simplifies the implementation of scalable geospatial data pipelines by binding together common open source geospatial libraries and Apache Spark™️. Mosaic also provides a set of examples and best practices for common geospatial use cases. It provides APIs for ST_ expressions and GRID_ expressions, supporting grid index systems such as H3 and British National Grid.

GitHub Sources →

Documentation →

Blog →

DLT-메타

이 프레임워크를 사용하면 delta live table 및 메타데이터를 사용하여 데이터를 쉽게 수집할 수 있습니다. DLT-META를 사용하면 한 명의 데이터 엔지니어가 수천 개의 테이블을 쉽게 관리할 수 있습니다. 몇몇 Databricks 고객은 프로덕션에서 1000+ 테이블을 처리하기 위해 DLT-META를 사용하고 있습니다.

Github 소스 →
자세히 알아보기 →

Smolder

Smolder provides an Apache Spark™ SQL data source for loading EHR data from HL7v2 message formats. Additionally, Smolder provides helper functions that can be used on a Spark SQL DataFrame to parse HL7 message text, and to extract segments, fields, and subfields from a message.

Github Sources →
Learn more →

Geoscan

육각형 계층적 공간 인덱스 를 기반으로 하는 밀도 기반 공간 cluster 에 대한 Apache Spark ML Estimator입니다.

Github 소스 →
자세히 알아보기 →

마이그레이션

고객이 Databricks 워크스페이스 사이에서 아티팩트를 마이그레이션하도록 지원하는 도구입니다. 고객은 구성과 코드 아티팩트를 백업으로 내보내거나, 다른 워크스페이스로의 마이그레이션 중 내보낼 수 있습니다.

Github 소스
자세히 알아보기: AWS | Azure

Data Generator

Generate relevant data quickly for your projects. The Databricks data generator can be used to generate large simulated/synthetic data sets for test, POCs, and other uses

Github Sources →
Learn more →

DeltaOMS

레이크하우스에서 메타데이터 및 운영 지표를 분석하기 위한 중앙 집중형 Delta 트랜잭션 로그 컬렉션입니다.

Github 소스 →
자세히 알아보기 →

Splunk Integration

Add-on for Splunk, an app that allows Splunk Enterprise and Splunk Cloud users to run queries and execute actions, such as running notebooks and jobs, in Databricks.

Github Sources →
Learn more →

DiscoverX

DiscoverX automates administration tasks that require inspecting or applying operations to a large number of Lakehouse assets.

Github Sources →

brickster

{brickster} is the R toolkit for Databricks, it includes:

Wrappers for Databricks API's (e.g. db_cluster_list, db_volume_read)
Browser workspace assets via RStudio Connections Pane (open_workspace())
Exposes the databricks-sql-connector via {reticulate} (docs)
Interactive Databricks REPL

Github Sources →
Documentation →
Blog →

DBX

This tool simplifies jobs launch and deployment process across multiple environments. It also helps to package your project and deliver it to your Databricks environment in a versioned fashion. Designed in a CLI-first manner, it is built to be actively used both inside CI/CD pipelines and as a part of local tooling for fast prototyping.

Github Sources →
Documentation →
Blog →

Tempo

The purpose of this project is to provide an API for manipulating time series on top of Apache Spark™. Functionality includes featurization using lagged time values, rolling statistics (mean, avg, sum, count, etc.), AS OF joins, and downsampling and interpolation. This has been tested on TB-scale of historical data.

GitHub Sources →
Documentation →
Webinar →

PyLint Plugin

This plugin extends PyLint with checks for common mistakes and issues in Python code specifically in Databricks Environment.

Github Sources →
Documentation →

PyTester

PyTester is a powerful way to manage test setup and teardown in Python. This library provides a set of fixtures to help you write integration tests for Databricks.

Github Sources →
Documentation →

Delta Sharing Java Connector

The Java connector follows the Delta Sharing protocol to read shared tables from a Delta Sharing Server. To further reduce and limit egress costs on the Data Provider side, we implemented a persistent cache to reduce and limit the egress costs on the Data Provider side by removing any unnecessary reads.

GitHub Sources →

Documentation →

Overwatch

Analyze all of your jobs and clusters across all of your workspaces to quickly identify where you can make the biggest adjustments for performance gains and cost savings.

Learn more →

UCX

UCX is a toolkit for enabling Unity Catalog (UC) in your Databricks workspace. UCX provides commands and workflows for migrate tables and views to UC. UCX allows to rewrite dashboards, jobs and notebooks to use the migrated data assets in UC. And there are many more features.

GitHub Sources →

Documentation →

Blog →

https://github.com/databrickslabs 의 모든 프로젝트는 계정은 탐색 분석용으로만 제공되며 SLA(서비스 수준 계약)가 있는 Databricks 에서 공식적으로 지원하지 않습니다. 그것들은 있는 그대로 제공되며 우리는 어떤 종류의 보증도 하지 않습니다. 이러한 프로젝트의 사용으로 인해 발생하는 문제와 관련된 지원 티켓을 제출하지 마십시오. 이 프로젝트를 사용하여 발견된 모든 문제는 리포지토리에 GitHub 문제로 제출해야 합니다. 시간이 허락하는 대로 검토되지만 지원을 위한 공식적인 SLA는 없습니다.