Improving Apache Spark’s Reliability with DataSourceV2

Download Slides

DataSourceV2 is Spark’s new API for working with data from tables and streams, but “v2” also includes a set of changes to SQL internals, the addition of a catalog API, and changes to the data frame read and write APIs. This talk will cover the context for those additional changes and how “v2” will make Spark more reliable and predictable for building enterprise data pipelines. This talk will include: * Problem areas where the current behavior is unpredictable or unreliable * The new standard SQL write plans (and the related SPIP) * The new table catalog API and a new Scala API for table DDL operations (and the related SPIP) * Netflix’s use case that motivated these changes


Try Databricks
See More Spark + AI Summit in San Francisco 2019 Videos

« back
About Ryan Blue

Ryan Blue works on open source projects, including Spark, Avro, and Parquet, at Netflix.