Coral and Transport: Portable SQL and UDFs for the Interoperability of Spark and Other Engines
- 데이터 레이크, 데이터 웨어하우스 및 데이터 레이크하우스
- Moscone South | Upper Mezzanine | 160
- 35 min
Big data compute infrastructure has continually grown over time, not only to keep pace with the scale of data applications, but also to accommodate a growing number of compute engines that together deliver collective value. As one of the most highly adopted engines in modern data lakes, Spark has significantly evolved on that front to generically support open data formats and catalogs. In this talk, we present two open source projects, Coral and Transport, that enable deep SQL and UDF interoperability between Spark and other engines, such as Trino and Hive.
Coral is a SQL analysis, rewrite, and translation engine that enables compute engines to interoperate and analyze different SQL dialects and plans, through the conversion to a common relational algebraic intermediate representation, called Coral IR. Coral IR decouples the input representation from the output yet preserves query semantics, even if it means syntax, operator, or operand manipulation.
Transport is a UDF framework that enables users to write UDFs against a single API but execute them as native UDFs of multiple engines, such as Spark, Trino, and Hive. Given a Transport UDF, its native Spark, as well as other engine UDFs are automatically generated by the Transport plugin. When executed, the UDFs directly leverage the internal tuple representation of the corresponding engine.
Further, we discuss how LinkedIn leverages Coral and Transport, and present a production use case for accessing views of other engines in Spark as well as enhancing Spark DataFrame and Dataset view schema. We discuss other potential applications such as automatic data governance and data obfuscation, query optimization, materialized view selection, incremental compute, and data source SQL and UDF communication.