Yin Huai

Software Engineer, Databricks

Yin is a Staff Software Engineer at Databricks. His work focuses on designing and building Databricks Runtime container environment, and its associated testing and release infrastructures. Before joining Databricks, he was a PhD student at The Ohio State University and was advised by Xiaodong Zhang. Yin is also an Apache Spark PMC member.

Past sessions

Summit 2021 Managing Millions of Tests Using Databricks

May 26, 2021 05:00 PM PT

Databricks Runtime is the execution environment that powers millions of VMs running data engineering and machine learning workloads daily in Databricks. Inside Databricks, we run millions of tests per day to ensure the quality of different versions of Databricks Runtime. Due to the large number of tests executed daily, we have been continuously facing the challenge of effective test result monitoring and problem triaging. In this talk, I am going to share our experience of building the automated test monitoring and reporting system using Databricks. I will cover how we ingest data from different data sources like CI systems and Bazel build metadata to Delta, and how we analyze test results and report failures to their owners through Jira. I will also show you how this system empowers us to build different types of reports that effectively track the quality of changes made to Databricks Runtime.

In this session watch:
Yin Huai, Software Engineer, Databricks


Summit 2014 Easy JSON Data Manipulation in Spark

June 29, 2014 05:00 PM PT

In this talk, I will introduce the new JSON support in Spark. With the JSON support, users do not need to define a schema for a JSON dataset. Instead, Spark SQL automatically infers the schema based on data. Then, users can write SQL queries to process this JSON dataset like processing a regular table, or seamlessly convert a JSON dataset to other formats (e.g. Parquet file). I will also talk about our ongoing efforts on letting users easily work with data from different sources with different formats.

Summit 2017 A Deep Dive into Spark SQL’s Catalyst Optimizer

June 5, 2017 05:00 PM PT

Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees. In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.#SFdev0

Learn more:

  • Deep Dive into Spark SQL’s Catalyst Optimizer
  • Cost Based Optimizer in Apache Spark 2.2
  • Catalyst: A Query Optimization Framework for Spark and Shark