Thousands of data architects, engineers, and scientists met at Data + AI Summit in San Francisco to hear from industry luminaries like Fei Fei Li and Yejin Choi, attend sessions on everything from building a custom LLM to preparing for Apache Spark™ 4, explore the latest in Databricks, and ultimately learn how to accelerate efforts to deploy data intelligence across their businesses.
Every day provided opportunities to improve existing skills, get introduced to something new, and gain the knowledge your business needs to thrive in the GenAI era. In fact, for many of the attendees, the challenge becomes making time for all the sessions they want to attend.
Whether you missed sessions in person or are just now attending virtually, the great news is that you can now watch all 500+ sessions (and the full keynote) on-demand! Below, I’m calling out some specific sessions for data architects, data engineers, and data scientists that I think are worth a watch!
Today, analytics and AI workloads are split across too many different environments. It becomes impossible for data architects to properly manage the underlying infrastructure. It’s one reason why so many companies are looking to consolidate. These sessions showcase why the Lakehouse is the unified platform enterprises need to unleash data intelligence across their businesses while ensuring the right security and governance throughout their data landscape.
Delta Lake Meets DuckDB via Delta Kernel
Speakers: Nick Lanham
Over the past few years, Delta-rs grew rapidly. And now, with delta-kernel-rs, it’s even easier for Rust and Python users to create connections. This session will cover how to bring Delta support to the open source analytical database DuckDB. It will discuss how the support works, the architecture of the integration, and lessons learned along the way.
Deep Dive into Delta Lake and UniForm on Databricks
Speakers: Joe Widen, Michelle Leon
This is a beginner’s guide to everything Delta Lake, a powerful open-source storage layer that brings reliability, performance, governance, and quality to existing data lakes. This session will provide an overview of Delta Lake, including how it’s built for both streaming and batch use cases, explain the power of Delta Lake and Unity Catalog together, and highlight innovative use cases of Delta Lake across different sectors. Attendees will also learn about Delta UniForm, a tool that makes it easy for developers to work across other lakehouse formats including Apache Iceberg and Apache Hudi.
Dependency Management in Spark Connect: Simple, Isolated, Powerful
Speakers: Hyukjin Kwon, Akhil Gudesa
Managing an application hosted in a distributed computing environment can be challenging. Ensuring that all nodes have the necessary environment to execute code and determining the actual location of the user's code are complex tasks, significantly more so when dynamic support is required. This session will cover how Spark Connect can simplify the management of a distributed computing environment. Through practical and comprehensive examples, attendees will learn how to create, package, utilize and update custom isolated environments ensuring flexible and seamless execution for both Python and Scala applications.
Fast, Cheap, and Easy Data Ingestion with AWS Lambda and Delta Lake
Speakers: R. Tyler Croy
Join R Tyler Cory, one of the creators of Delta Rust, learn how to work with Delta tables from AWS Lambdas. Using the native Python or Rust libraries for Delta Lake, you'll learn to explore the transaction log, write updates, perform table maintenance, and even query Delta tables in milliseconds from AWS Lambda.
Let's Do Some Data Engineering With Rust and Delta Lake!
Speakers: R. Tyler Croy
The future of data engineering is looking increasingly Rust-y. By adopting the foundational crates of Delta Lake, data fusion, and arrow, developers can write high-performance and low-cost ingestion pipelines, transformation jobs, and data query applications. Don’t know Rust? No problem. You’ll review fundamental concepts of the language as they pertain to the data engineering domain with a co-creator of Delta Rust and leave with a basis to apply Rust to real-world data problems.
What's Wrong with the Medallion Architecture?
Speakers: Simon Whiteley
While enterprises are reaping the benefits of the lakehouse architecture, many have one regret: layering their zones. No one really knows what terms like “silver” vs. “gold” mean. The reality is that Medallion architecture may not always be the best option. Using real-world examples, this session will dive into when and how to use it.
In businesses today, speed is paramount. Leaders want access to information immediately. That’s putting more pressure on the individuals tasked with managing and optimizing streaming ETL pipelines. These sessions help data engineers deliver on the promise of real-time analytics and AI.
Delta Live Tables in Depth: Best Practices for Intelligent Data Pipelines
Speakers: Michael Armbrust, Paul Lappas
Learn how to master Delta Live Tables from one of the people who knows it best. The original creator of Spark SQL, Structured Streaming and Delta, Michael Armbrust will get attendees up-to-speed on what’s new with DLT and what’s coming. (Spoiler alert: Some BIG news.)
Effective Lakehouse Streaming with Delta Lake and Friends
Speakers: Scott Haines, Ashok Singamaneni
In this session, attendees discover the true power of the streaming lakehouse architecture, how to achieve success at scale, and, more importantly, why Delta Lake is the key to unlocking a consistent data foundation and empowering a "stress-free" data ecosystem.
Stranger Triumphs: Automating Spark Upgrades & Migrations at Netflix
Speakers: Holden Krau, Robert Merck
Apache Spark™ 4 is on the horizon. So what’s involved in upgrading to the latest and greatest Spark? Learn how Netflix automated large parts of its upgrade and how you can use the techniques for your data platform. In this session, you will learn how to: upgrade your Spark pipelines without crying and validate Spark pipelines even when you don't trust the tests.
Introducing the New Python Data Source API for Apache Spark™
Speakers: Allison Wang, Ryan Nienhuis
Traditionally, integrating custom data sources into Spark required understanding Scala, posing a challenge for the vast Python community. Our new API simplifies this process, allowing developers to implement custom data sources directly in Python without the complexities of existing APIs. This session will explore the motivations and the code behind how we’ve made reading and writing operations for Python developers much easier.
Incremental Change Data Capture: A Data-Informed Journey
Speakers: Christina Taylor
Learn how to iterate on incremental ingestion from SaaS applications, relational databases, and event streams into a centralized data lake, the role of CDCs and how to ultimately streamline maintenance and improve reliability with Delta Lake. Attendees will walk away with a data-informed mentality to design architecture that promotes long-term stewardship and developer happiness
What’s next for the upcoming Apache Spark™ 4.0
Speakers: Xiao Li, Wenchen Fan
The upcoming release of Apache Spark 4.0 delivers substantial enhancements that refine the functionality and augment the developer experience with the unified analytics engine. This is your chance to ask the experts what’s coming and how to prepare.
GenAI is inescapable. Every business is figuring out how to develop and deploy LLMs. For those actually making AI and ML a reality, these sessions help keep you up-to-date on the latest techniques for improving and accelerating your GenAI strategy.
Software 2.0: Shipping LLMs with New Knowledge
Speakers: Sharon Zhou
Increasingly, companies want to take existing LLMs and teach them new knowledge to differentiate the technology. This process goes beyond just prompting or retrieving—it also involves instruction-finetuning, content-finetuning, pretraining, and more. In this session, you'll learn about Lamini, an all-in-one LLM stack that makes LLMs less picky about the data it can learn from, making it easy for LLMs to take in billions of new documents.
Exploring MLOps and LLMOps: Architectures and Best Practices
Speakers: Joseph Bradley, Yinxi Zhang and Arpit Jasapara
This session offers a detailed look at the architectures involved in Machine Learning Operations (MLOps) and Large Language Model Operations (LLMOps). Attendees will learn about the technical specifics and practical applications of MLOps and LLMOps, including the key components and workflows that define these fields. And they’ll walk away with strategies for implementing effective MLOps and LLMOps in their own projects.
In the Trenches with DBRX: Building a State-of-the-Art Open-Source Model
Speakers: Jonathan Frankle, Abhinav Venigalla
Want the behind-the-scenes story on how we built DBRX, a cutting-edge, open-source foundation model trained in-house by Databricks? Hear from the people who built it about the tools, methods, and lessons learned during the development process. Attendees will get an inside look at what it takes to train a high-quality LLM, hear why we chose Mixture of Experts architecture, and learn how they can use the same tools and techniques to build their own custom models.
Introduction to DBRX and other Databricks Foundation Models
Speakers: Margaret Qian, Hagay Lupesko
This session offers a comprehensive introduction to DBRX and other foundational models available on Databricks. Attendees will get practical guidance on how to leverage these models to enhance data analytics and machine learning projects. And they’ll leave with a clear understanding of how to effectively utilize Databricks' foundational models to drive innovation and efficiency in their data-driven initiatives.
Layered Intelligence: Generative AI Meets Classical Decision Sciences
Speakers: Danielle Heymann
The session will explore how Generative AI, especially LLMs, integrates into classical decision science methodologies. Attendees will learn how LLMs extend beyond chatbots to enhance optimization algorithms, statistical models, and graph analytics—breathing new life into decision sciences and advancing strategic analytics and decision-making. This layered approach brings a new edge to traditional methods, allowing for complex problem-solving, nuanced data interaction, and improved interpretability.
Building Production RAG Over Complex Documents
Speakers: Jerry Liu
RAG is a powerful technique that enables enterprises to further customize existing LLMs on their own data. However, building production RAG is very challenging, especially as users scale to larger and more complex data sources. RAG is only as good as your data, and developers must carefully consider how to parse, ingest, and retrieve their data to successfully build RAG over complex documents. This session provides an in-depth exploration of this entire process.
SEA-LION: Representing the Diverse Languages of Southeast Asia with LLMs
Speakers: Jeanne Choo, Ngee Chia Tai
Southeast Asia is one of the world's most culturally diverse regions, covering countries such as Singapore, Vietnam, Thailand, and Indonesia. People speak multiple languages and draw cultural influences from China, India and the West. Learn how, working with Databricks MosaicML, the Singapore government built SEA-LION, an open-sourced large language model trained on local languages such as Thai, Indonesian and Tamil.
State-Of-The-Art Retrieval Augmented Generation At Scale In Spark NLP
Speakers: David Talby, Veysel Kocaman
Get a crash course in scaling and building RAG LLM pipelines for production. Current systems struggle to efficiently handle the jump from proof-of-concept production. This session will show how to address scaling issues with the open source Spark NLP library.
Check out all the Data + AI Summit sessions and keynotes here!
