DevOps Essentials for Data Engineering

This course explores software engineering best practices and DevOps principles, specifically designed for data engineers working with Databricks. Participants will build a strong foundation in key topics such as code quality, version control, documentation, and testing. The course emphasizes DevOps, covering core components, benefits, and the role of continuous integration and delivery (CI/CD) in optimizing data engineering workflows.

You will learn how to apply modularity principles in PySpark to create reusable components and structure code efficiently. Hands-on experience includes designing and implementing unit tests for PySpark functions using the pytest framework, followed by integration testing for Databricks data pipelines with Spark Declarative Pipeline and Jobs to ensure reliability.

The course also covers essential Git operations within Databricks, including using Databricks Git Folders to integrate continuous integration practices. Finally, you will take a high level look at various deployment methods for Databricks assets, such as REST API, CLI, SDK, and Databricks Asset Bundles (DABs), providing you with the knowledge of techniques to deploy and manage your pipelines.

By the end of the course, you will be proficient in software engineering and DevOps best practices, enabling you to build scalable, maintainable, and efficient data engineering solutions.

Note:

1. This is the fourth course in the 'Data Engineering with Databricks' series.

2. Databricks Academy is transitioning from video lectures to a more streamlined PDF format with slides and notes for all self-paced courses. Please note that demo videos will still be available in their original format. We would love to hear your thoughts on this change, so please share your feedback through the course survey at the end. Thank you for being a part of our learning community!

Languages Available: English | 日本語 | Português BR | 한국어

Skill Level

Associate

Duration

Prerequisites

- Proficient knowledge of the Databricks platform, including experience with Databricks Workspaces, Apache Spark, Delta Lake and the Medallion Architecture, Unity Catalog, Delta Live Tables, and Workflows. A basic understanding of Git version control is also required.
- Experience ingesting and transforming data, with proficiency in PySpark for data processing and DataFrame manipulations. Additionally, candidates should have experience writing intermediate level SQL queries for data analysis and transformation.
- Knowledge of Python programming, with proficiency in writing intermediate level Python code, including the ability to design and implement functions and classes. Users should also be skilled in creating, importing, and effectively utilizing Python packages.

Self-Paced

Custom-fit learning paths for data, analytics, and AI roles and career paths through on-demand videos

Customer registration Partner registration

See all our registration options

Registration options

Databricks has a delivery method for wherever you are on your learning journey

Self-Paced

Custom-fit learning paths for data, analytics, and AI roles and career paths through on-demand videos

Instructor-Led

Public and private courses taught by expert instructors across half-day to two-day courses

Blended Learning

Self-paced and weekly instructor-led sessions for every style of learner to optimize course completion and knowledge retention. Go to Subscriptions Catalog tab to purchase

Purchase now

Skills@Scale

Comprehensive training offering for large scale customers that includes learning elements for every style of learning. Inquire with your account executive for details

Upcoming Public Classes

Model Development at Scale

In this course, you will develop an in-depth understanding of how to design, implement, and govern scalable machine learning systems that operate effectively at enterprise scale. The curriculum is organized into three experiential modules: developing distributed ML workflows with frameworks such as Apache SparkML and Ray, transitioning local ML development to distributed compute using tools like Pandas on Spark, and operationalizing and governing production models with Databricks’ MLOps ecosystem.

Through hands-on projects, you will construct end-to-end distributed ML pipelines using the SparkML workflow, applying Transformers, Estimators, and the fit/transform paradigm for both classification and regression tasks. You will version, compare, and manage experiments using MLflow 3.0 to ensure reproducibility and governance, capturing lineage between data, features, and model artifacts. Additionally, you will apply scalable Hyperparameter Optimization frameworks to improve model performance at scale.

The course concludes by demonstrating complete lifecycle management, from experimentation to production deployment, using Unity Catalog and Model Serving. You will learn to operationalize trained models, monitor their performance, and implement strong governance over models, features, and Delta assets within the Databricks environment.

Paid & Subscription

Lab

Professional

Platform Administrator

Get Started with Data Governance on Databricks

In this course, you will explore how Unity Catalog enables secure, centralized data governance and fine-grained access control on Databricks. You will learn about table and volume types, catalog and schema configuration, group-based access management, and strategies for migrating existing access controls into Unity Catalog. The course also explains how to design and apply fine-grained controls such as row-level security, column masking, and attribute-based access control, how to combine these mechanisms across data and AI assets, and how to align them with broader governance requirements for compliant, scalable access management.

Languages Available: English | 日本語 | 한국어

Building Enterprise Applications with Databricks Apps

This course introduces Databricks Apps, a new feature that empowers developers to build and deploy secure data and AI applications directly on the Databricks platform. It is primarily designed for data scientists, data engineers, and developers who need to create custom applications that make data insights accessible to non-technical users within their organizations.The course begins with an overview of the main components of Databricks Apps, followed by the deployment of a simple app using available templates. Next, you will learn how to build and deploy a data-driven application that integrates with other platform features. The course then covers how to set up an IDE for Databricks Apps, and how to edit and deploy an app using an IDE. It concludes with how to implement security and best practices for an enterprise app.

Note: Databricks Academy is transitioning from video lectures to a more streamlined PDF format with slides and notes for all self-paced courses. Please note that demo videos will still be available in their original format. We would love to hear your thoughts on this change, so please share your feedback through the course survey at the end. Thank you for being a part of our learning community!

Paid & Subscription

Lab

Associate