FugueSQL—The Enhanced SQL Interface for Pandas and Spark DataFrames
- Data Engineering
- Moscone South | Upper Mezzanine | 160
- 35 min
To cater to users coming from a SQL background, Pandas and Spark have SQL interfaces that allow SQL lovers to manipulate DataFrames in their language of choice. Still, these SQL interfaces are often insufficient for being used for end-to-end workflows. Having such an interface would allow SQL users to seamlessly transition between working on databases, to working on flat files. This would remove the need to learn specific frameworks to work with data in-memory.
In this talk, we'll introduce FugueSQL, a language that allows SQL-lovers to use their preferred tool in expressing end-to-end computation workflows. FugueSQL has additional keywords such as LOAD and SAVE that support ETL operations on flat files such as parquet, avro and csv. Because SQL as an interface is agnostic to frameworks, FugueSQL also allows scaling from Pandas to Spark without any code change.
To elevate FugueSQL as a first-class interface for working with data, enhancements such as variable assignment, jinja templating, and Python interoperability have been added. Python functions can be invoked inside predominantly SQL code. FugueSQL also is meant to take advantage of the Spark engine to handle big data. Thus, it has support for distributed compute-relevant operations such as PARTITION, PERSIST and BROADCAST. In this talk, we will show the functionality of FugueSQL for end-to-end Extract, Transform, Load (ETL) pipelines with the Spark and engine on Databricks.