SESSION

Building a Multimodal Data Lakehouse with the Daft Distributed Python Dataframe

Accept Cookies to Play Video

OVERVIEW

EXPERIENCEIn Person
TYPELightning Talk
TRACKData Engineering and Streaming
TECHNOLOGIESDeveloper Experience, ETL, Orchestration
SKILL LEVELIntermediate
DURATION20 min

Modern data workloads come in all shapes and sizes - numbers, strings, JSONs, images, whole PDF textbooks and more. To process this data we still rely on utilities such as: ffmpeg for videos, jq for JSON and Pytorch for tensors.

 

However, these tools were not built for large-scale ETL. This means that we often need to build bespoke data pipelines that orchestrate data movement and custom tooling. If only downloading images, resizing them and running vision models was as simple as extracting a substring in SparkSQL!

 

Daft (https://www.getdaft.io) is a next-generation distributed query engine built on Python and Rust. It provides a familiar dataframe interface for easy and performant processing of multimodal data at scale. Join us as we demonstrate how to build a multimodal data lakehouse using Daft on your existing infrastructure (S3, DeltaLake, Databricks and Spark).

SESSION SPEAKERS

Jay Chia

/Co-Founder
Eventual Computing