Wenchen Fan

Software Engineer, Databricks

Wenchen Fan is a software engineer at Databricks, working on Spark Core and Spark SQL. He mainly focuses on the Apache Spark open source community, leading the discussion and reviews of many features/fixes in Spark. He is a Spark committer and a Spark PMC member.

Past sessions

Summit 2021 Deep Dive into the New Features of Apache Spark 3.1

May 27, 2021 04:25 PM PT

Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.1 extends its scope with more than 1500 resolved JIRAs. We will talk about the exciting new developments in the Apache Spark 3.1 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos.

The following features are covered: the SQL features for ANSI SQL compliance, new streaming features, and Python usability improvements, the performance enhancements and new tuning tricks in query compiler.

In this session watch:
Wenchen Fan, Software Engineer, Databricks
Xiao Li, Engineering Manager, Databricks

[daisna21-sessions-od]

Summit 2020 Deep Dive into the New Features of Apache Spark 3.0

June 23, 2020 05:00 PM PT

Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos.

The following features are covered: accelerator-aware scheduling, adaptive query execution, dynamic partition pruning, join hints, new query explain, better ANSI compliance, observable metrics, new UI for structured streaming, new UDAF and built-in functions, new unified interface for Pandas UDF, and various enhancements in the built-in data sources [e.g., parquet, ORC and JDBC].

Summit 2018 Apache Spark Data Source V2—continues

June 5, 2018 05:00 PM PT

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.

1) Generality: support reading/writing most data management/storage systems.

2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.

Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.

Session hashtag: #DDSAIS12

Summit 2018 Apache Spark Data Source V2

June 5, 2018 05:00 PM PT

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.

1) Generality: support reading/writing most data management/storage systems.

2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.

Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.

Session hashtag: #DDSAIS12

Summit 2018 Deep Dive into Spark SQL with Advanced Performance Tuning

June 4, 2018 05:00 PM PT

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Session hashtag: #Exp3SAIS

Summit 2017 A Developer’s View into Spark’s Memory Model

June 6, 2017 05:00 PM PT

As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark's backend execution and push performance closer to the limits of modern hardware. In this talk, we'll take a deep dive into Apache Spark's unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection. Session hashtag: #SFdev25

Summit Europe 2017 A Developer’s View Into Spark’s Memory Model

October 24, 2017 05:00 PM PT

As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.
Session hashtag: #EUdd2