홈페이지Data + AI Summit 2022 로고
Watch on demand

Coral and Transport: Portable SQL and UDFs for the Interoperability of Spark and Other Engines

On Demand

Type

  • Session

Format

  • Hybrid

Track

  • 데이터 레이크, 데이터 웨어하우스 및 데이터 레이크하우스

Difficulty

  • Intermediate

Room

  • Moscone South | Upper Mezzanine | 160

Duration

  • 35 min
Download session slides

개요

Big data compute infrastructure has continually grown over time, not only to keep pace with the scale of data applications, but also to accommodate a growing number of compute engines that together deliver collective value. As one of the most highly adopted engines in modern data lakes, Spark has significantly evolved on that front to generically support open data formats and catalogs. In this talk, we present two open source projects, Coral and Transport, that enable deep SQL and UDF interoperability between Spark and other engines, such as Trino and Hive.

Coral is a SQL analysis, rewrite, and translation engine that enables compute engines to interoperate and analyze different SQL dialects and plans, through the conversion to a common relational algebraic intermediate representation, called Coral IR. Coral IR decouples the input representation from the output yet preserves query semantics, even if it means syntax, operator, or operand manipulation.

Transport is a UDF framework that enables users to write UDFs against a single API but execute them as native UDFs of multiple engines, such as Spark, Trino, and Hive. Given a Transport UDF, its native Spark, as well as other engine UDFs are automatically generated by the Transport plugin. When executed, the UDFs directly leverage the internal tuple representation of the corresponding engine.

Further, we discuss how LinkedIn leverages Coral and Transport, and present a production use case for accessing views of other engines in Spark as well as enhancing Spark DataFrame and Dataset view schema. We discuss other potential applications such as automatic data governance and data obfuscation, query optimization, materialized view selection, incremental compute, and data source SQL and UDF communication.

Session Speakers

Headshot of Walaa Eldin Moustafa

Walaa Eldin Moustafa

Senior Staff Software Engineer

LinkedIn

Headshot of Wenye Zhang

Wenye Zhang

Senior Software Engineer

LinkedIn

Data+AI Summit 하이라이트 보기

Watch on demand