Introduction to Data Analysis

for Aspiring Data Scientists


Join us for a four-part learning series: Introduction to Data Analysis for Aspiring Data Scientists. This self-paced online workshop series is for anyone and everyone interested in learning about data analysis. No previous programming experience required.

Each workshop page contains the session video recording, transcripts, speaker info, and a GitHub link to access the notebooks and resources. We suggest you start with Part One, Introduction to Python, and continue from there in order because each workshop builds upon the last.

If you’d like to follow along, please Sign Up for your free Community Edition account or download the Delta Lake library.

Introduction to Python

In this workshop, we will show you the simple steps needed to program in Python using a notebook environment on the free Databricks Community Edition.This workshop covers major foundational concepts necessary for you to start coding in Python, with a focus on data analysis. No prior programming knowledge is required.

Data Analysis with Pandas

This workshop is on pandas, a powerful open-source Python package for data analysis and manipulation. In this workshop, you will learn how to read data, compute summary statistics, check data distributions, conduct basic data cleaning and transformation, and plot simple visualizations. Although no prep work is required, we do recommend basic python knowledge. Watch Part One, Introduction to Python to learn about Python.

Introduction to ML: scikit-learn

scikit-learn is one of the most popular open-source machine learning libraries among data science practitioners. This workshop will walk through what machine learning is, the different types of machine learning, and how to build a simple machine learning model. This workshop focuses on the techniques of applying and evaluating machine learning methods, rather than the statistical concepts behind them.

Introduction to Apache Spark

This workshop covers the fundamentals of Apache Spark, the most popular big data processing engine. In this workshop, you will learn how to ingest data with Spark, analyze the Spark UI, and gain a better understanding of distributed computing. No prior knowledge of Spark is required, but Python experience is highly recommended.

Tech Talks: Diving Into Delta Lake

Dive through the internals of Delta Lake, a popular open source technology enabling ACID transactions, time travel, schema enforcement and more on top of your data lakes.