Data Ingestion with Lakeflow Connect

This course provides a comprehensive introduction to Lakeflow Connect, a scalable and simplified solution for ingesting data into Databricks from a wide range of sources. You’ll begin by exploring the different types of Lakeflow Connect connectors (Standard and Managed) and learn various data ingestion techniques, including batch, incremental batch, and streaming ingestion. You'll also review the key benefits of using Delta table and the Medallion architecture

Next, you’ll develop practical skills for ingesting data from cloud object storage using Lakeflow Connect Standard Connectors. This includes working with methods such as CREATE TABLE AS SELECT (CTAS), COPY INTO, and Auto Loader, with an emphasis on the benefits and considerations of each approach. You’ll also learn how to append metadata columns to your bronze-level tables during ingestion into the Databricks Data Intelligence Platform. The course then covers how to handle records that don’t match your table schema using the rescued data column, along with strategies for managing and analyzing this data. You’ll also explore techniques for ingesting and flattening semi-structured JSON data.

Following this, you’ll explore how to perform enterprise-grade data ingestion using Lakeflow Connect Managed Connectors to bring in data from databases and Software-as-a-Service (SaaS) applications. The course also introduces Partner Connect as an option for integrating partner tools into your ingestion workloads.

Finally, the course wraps up with alternative ingestion strategies, including MERGE INTO operations and leveraging the Databricks Marketplace, equipping you with a strong foundation to support modern data engineering use cases.

Note: For SCORM lecture files, please ensure that you close the SCORM window after completing the content. Do not click the ‘Next Lesson’ button, as doing so may prevent the SCORM module from being marked as complete.

Skill Level

Associate

Duration

Prerequisites

⇾ Basic understanding of the Databricks Data Intelligence platform, including Databricks Workspaces, Apache Spark, Delta Lake, the Medallion Architecture and Unity Catalog.

⇾ Basic understanding of data ingestion workflows (batch, streaming, incremental) and general ETL principles

⇾ Experience working with various file formats (e.g., Parquet, CSV, JSON, TXT).

⇾ Proficiency in SQL and Python.

⇾ Familiarity with running code in Databricks notebooks.

Self-Paced

Custom-fit learning paths for data, analytics, and AI roles and career paths through on-demand videos

Customer registration Partner registration

See all our registration options

Registration options

Databricks has a delivery method for wherever you are on your learning journey

Self-Paced

Custom-fit learning paths for data, analytics, and AI roles and career paths through on-demand videos

Instructor-Led

Public and private courses taught by expert instructors across half-day to two-day courses

Blended Learning

Self-paced and weekly instructor-led sessions for every style of learner to optimize course completion and knowledge retention. Go to Subscriptions Catalog tab to purchase

Purchase now

Skills@Scale

Comprehensive training offering for large scale customers that includes learning elements for every style of learning. Inquire with your account executive for details

Upcoming Public Classes

Generative AI Engineer

Get Started with Databricks for Generative AI

This course offers a practical introduction to the Databricks Data Intelligence Platform, focusing on its key components and features for building and deploying generative AI systems. Participants will learn how Databricks facilitates the development of scalable generative AI solutions and explore tools such as AI Search, the Agent Framework, and MLflow's generative AI capabilities for model tracking and logging. This course includes hands-on experience in constructing and evaluating Retrieval-Augmented Generation (RAG) pipelines, deploying generative AI agents, and leveraging evaluation frameworks to optimize performance. By the end of the course, learners will be equipped with the skills to design, deploy, and monitor common generative AI applications on the Databricks Data Intelligence Platform.

Note: Databricks Academy is transitioning from video lectures to a more streamlined PDF format with slides and notes for all self-paced courses. Please note that demo videos will still be available in their original format. We would love to hear your thoughts on this change, so please share your feedback through the course survey at the end. Thank you for being a part of our learning community!

Languages Available: English | 日本語 | Português BR | 한국어

Get Started with Databricks for Data Engineering - Mandarin Chinese

本课程介绍在 Databricks Data Intelligence Platform 上执行基础数据工程工作流所需的核心技能。您将探索工作区、使用 Unity Catalog，并学习数据工程师在 Databricks 上日常使用的基础构建模块。

本课程采用实操方式。您首先熟悉工作区，然后针对每个主题完成配套的演示（Demo）和实验（Lab）笔记本。演示会由讲师或引导式笔记本带您了解某个概念；实验则要求您自行应用刚刚所学的内容。

课程内容涵盖：

• Databricks Data Intelligence Platform，以及 Databricks 工作区、Unity Catalog 和笔记本如何协同运作。

• 创建和管理 Delta Lake 表。

• 使用 INSERT、UPDATE 和 DELETE 修改数据。

• 探索 Delta 表的版本历史记录和时间旅行。

• 使用 Lakeflow Connect 选项（CTAS、上传 UI 和 COPY INTO）引入数据。

• 构建奖牌架构管道，通过 Bronze、Silver 和 Gold 层转换数据。

• 使用 Lakeflow Jobs 实现管道自动化。

•（附加内容）使用 Spark Declarative Pipelines 构建声明式管道。

Paid & Subscription

Lab

Onboarding

Machine Learning Practitioner

Get Started with Databricks for Machine Learning

In this course, you will develop the foundational skills needed to use the Databricks Data Intelligence Platform for executing machine learning workflows and supporting data science workloads. You will explore the platform from the perspective of a machine learning practitioner, covering topics such as building and managing features with Feature Engineering in Unity Catalog, end-to-end model lifecycle management with MLflow, and pipeline orchestration with Lakeflow Jobs. Additionally, you will learn about real-time model inference with Databricks Model Serving and experience Databricks' transparent, conversational approach to model development through Genie Code - Data Science Agent Mode, where you use natural language prompts to generate, run, and iteratively refine executable ML workflows directly in your notebook. The course includes instructor-led demonstrations, culminating in a comprehensive lab that reinforces the concepts covered throughout.

Languages Available: English | 日本語 | Português BR | 한국어

Paid & Subscription

Lab

Onboarding