Spark + AI Summit 2020 features a number of pre-conference training workshops that include a mix of instruction and hands-on exercises to help you improve your Apache Spark™ and Data Engineering skills.
Role: Business Leaders
Duration: Half Day
Discover Databricks and how it allows your data teams to stop working in silos, simplify data preparation, allow an agile AI ecosystem, and stop infrastructure from getting in the way. In this course, we’ll review foundational big data concepts, explore why many organizations are struggling with achieving true artificial intelligence, and dive into how the components of the Unified Data Analytics platform can be used to overcome those challenges.
Prerequisites:
Role: Business Leader, Platform Administrator, SQL Analyst, Data Engineer, Data Scientist
Duration: Half Day
Learn about what Delta Lake is, how it simplifies and optimizes data architecture, and the engineering of data pipelines. This course dives into the core features of Delta Lake and how they bring reliability, performance, and lifecycle management to data lakes.
Prerequisites:
Role: Platform Administrator
Duration: Half Day
Learn administration and security best practices for managing your Databricks workspace. In this course, we’ll guide you through using the Admin Console to manage users and workspace storage, configure access control for your workspace, clusters, pools, and jobs, and apply cluster provisioning strategies and usage management features to maximize usability and cost effectiveness in different scenarios. Then, we’ll cover data protection features and configure data access control with Databricks best practices. Lastly, we’ll describe the Databricks platform architecture and deployment models, as well as the network security and compliance features for each.
Prerequisites:
Role: Data Engineer, Data Scientist
Duration: Half Day
Learn the fundamentals of Spark programming in a case study driven course that explores the core components of the DataFrame API. You’ll read and write data to various sources, preprocess data by correcting schemas and parsing different data types, and apply a variety of DataFrame transformations and actions to answer business questions. This course is designed to provide the essential concepts and skills you’ll need to navigate the Spark documentation and start programming immediately. This class is taught in Python/Scala.
Prerequisites:
Role: SQL Analyst
Duration: Half Day
Learn how to leverage SQL on Databricks to easily discover insights on big data. The Databricks workspace provides a powerful data processing environment where data professionals can follow traditional data analysis workflows including exploring, visualizing, and preparing data for sharing with stakeholders. This course is designed to get you started using Databricks functionality to gain shareable insights on data. This class is taught in SQL only.
Prerequisites:
Role: Data Engineer
Duration: Half Day
Learn and implement best practices for tuning while diagnosing and fixing various performance problems. You’ll complete guided coding challenges and refactor existing code to increase overall performance by applying the best practices you’ve learned. This class is taught in Python/Scala.
Prerequisites:
Role: Data Engineer
Duration: Half Day
Learn to build robust data pipelines using Apache Spark and Delta Lake on Databricks, performing ETL, data cleansing, and data aggregation. Delta Lake is designed to overcome many problems associated with traditional data lake pipelines.
Prerequisites:
Role: Data Engineer
Duration: Half Day
Learn how to use Structured Streaming to ingest data from files and publisher-subscribe systems. You’ll learn the fundamentals of streaming systems, how to read, write, and display streaming data, and how Structured Streaming is used with Databricks Delta. You’ll then use a publish-subscribe system to stream data and visualize meaningful insights. This class is taught concurrently in Python and Scala.
Prerequisites:
Role: Data Scientist
Duration: Half Day
This course focuses on teaching distributed machine learning with Spark. Students will build and evaluate pipelines with MLlib, understand the differences between single node and distributed ML, and optimize hyperparameter tuning at scale. This class is taught concurrently in Python and Scala.
Prerequisites:
Role: Data Scientist
Duration: Half Day
This course offers a thorough overview of how to scale training and deployment of neural networks with Apache Spark. We guide students through building deep learning models with TensorFlow, perform distributed inference with Spark UDFs via MLflow, and train a distributed model across a cluster using Horovod. This course is taught entirely in Python.
Prerequisites:
Role: Data Scientist
Duration: Half Day
In this course you will learn Reinforcement Learning theory and get hands-on practice. Upon completion of this course, you understand the differences between supervised, unsupervised, reinforcement learning, and understand Markov Decision Processes (MDPs) and Dynamic Programming. You will be able to formulate a reinforcement learning problem, and implement policy evaluation, policy iteration and value iteration algorithms in Python (using Dynamic Programing). This course is taught entirely in Python.
Prerequisites:
Role: Data Scientist
Duration: Half Day
In this course you will learn model-free Reinforcement Learning theory and get hands-on practice. You will be able to formulate a reinforcement learning problem, and implement model-free Reinforcement Learning algorithms. In particular you will implement Monte-Carlo, TD and Sarsa algorithms for prediction and control tasks. This course is taught entirely in Python.
Prerequisites:
Role: Data Scientists and Data Engineers
Duration: Half Day
In this hands-on course, data scientists and data engineers learn the best practices for managing experiments, projects, models, and a production model registry using MLflow. By the end of this course, you will have built a pipeline to train, register, and deploy machine learning models using the environment they were trained with. This course is taught entirely in Python and pairs well with the Machine Learning Deployment course.
Prerequisite:
Role: Data Scientists and Data Engineers
Duration: Half Day
In this hands-on course, data scientists and data engineers learn best practices for deploying machine learning models in these paradigms: batch, streaming, and real time using REST. It explores common production issues faced when deploying machine learning solutions and monitoring these models once they have been deployed into production. By the end of this course, you will have built the infrastructure to deploy and monitor machine learning models in various deployment scenarios. This course is taught entirely in Python and pairs well with the MLflow course.
Prerequisite:
Role: Data Scientist
Duration: Half Day
In this course students will learn how to apply machine learning techniques in a distributed environment using SparkR and sparklyr. Students will learn about the Spark architecture, Spark DataFrame APIs, build ML models, and perform hyperparameter tuning and pipeline optimization. The class is a combination of lectures, demos and hands-on labs. This course is taught entirely in R.
Prerequisite:
Role: Data Scientist
Duration: Half Day
This course will teach you the fundamentals of natural language processing (NLP) and how to do it at scale. You will solve classification, sentiment analysis, and text wrangling tasks, by applying pre-trained word embeddings, generating term-frequency-inverse-document-frequency (TFIDF) vectors for your dataset, and using dimensionality reduction techniques, and many more. This course is taught entirely in Python.
Prerequisite:
Role: Data Engineer, Data Scientist
Duration: Half Day
In this half-day course, you will learn how Databricks and Spark can help solve real-world problems one faces when working with financial data. You’ll learn how to deal with dirty data and how to get started with Structured Streaming and real-time analytics. Students will also receive a longer take-home capstone exercise as bonus content to the class where they can apply all the concepts presented. This class is taught concurrently in Python and Scala.
Prerequisite:
Role: Data Engineer, Data Scientist
Duration: Half Day
In this half-day course, you will learn how Databricks and Spark can help solve real-world problems you face when working with retail data. You’ll learn how to deal with dirty data, and get started with Structured Streaming and real-time analytics. Students will also receive a longer take-home capstone exercise as bonus content to the class where they can apply all the concepts presented. This class is taught concurrently in Python and Scala.
Prerequisite:
Role: Data Engineer, Data Scientist
Duration: Half Day
In this half-day course, you will learn how Databricks and Spark can help solve real-world problems you face when working with healthcare data. You’ll learn how to deal with dirty data, and get started with Structured Streaming and real-time analytics. Students will also receive a longer take-home capstone exercise as bonus content to the class where you can test all the concepts presented. This class is taught concurrently in Python and Scala.
Prerequisite:
Role: Data Engineer, Data Scientist
Duration: Half Day
In this half-day course, students will learn how Databricks and Spark can help solve real-world problems you face when working with manufacturing data. Students will learn how to deal with dirty data, and to get started with Structured Streaming and real-time analytics. Students will also receive a longer take-home capstone exercise as bonus content to the class where you can test all the concepts presented.
Prerequisite:
Role: Data Engineer, Data Scientist
Duration: Half Day
In this half-day course, students will familiarize themselves with the format of the Databricks Certified Associate Developer for Apache Spark 2.4 exam and tips for preparation. We will review what parts of the DataFrame API and Spark architecture are covered in the exam and the skills they need to prepare for the exam.
Prerequisite:
Role: SQL Analyst, Data Engineer, Data Scientist
Duration: 90 minutes, repeated 4x
This course covers the new features in Spark 3.0. It focuses on updates to performance, monitoring, usability, stability, extensibility, PySpark, and SparkR. Students will also learn about backwards compatibility with 2.x and the considerations required for updating to Spark 3.0.
Prerequisite: