Watch all keynotes from SPARK + AI SUMMIT 2020 North America.
Speakers include: Ali Ghodsi, Matei Zaharia, Brooke Wenig, Reynold Xin, Vish Subramanian, Dr. Phillip Atiba Goff, Prof. Jennifer Chayes, Nate Silver, Clemens Mewald, Lauren Richie, Sue Ann Hong, Rohan Kumar, Sarah Bird, Anurag Sehgal, Kim Hazelwood, Hany Farid, Adam Pazske, Amy Heineike.

Realizing the Vision of the Data Lakehouse

Ali Ghodsi, Co-founder & CEO, Original Creator of Apache Spark, Databricks

Data warehouses have a long history in decision support and business intelligence applications. But, data warehouses were not well suited to dealing with the unstructured, semi-structured, and streaming data common in modern enterprises. This led to organizations building data lakes of raw data about a decade ago. But, they also lacked important capabilities. The need for a better solution has given rise to the data lakehouse, which implements similar data structures and data management features to those in a data warehouse, directly on the kind of low cost storage used for data lakes.

This keynote by Databricks CEO, Ali Ghodsi, explains why the open source Delta Lake project takes the industry closer to realizing the full potential of the data lakehouse, including new capabilities within the Databricks Unified Data Analytics platform to significantly accelerate performance. In addition, Ali will announce new open source capabilities to collaboratively run SQL queries against your data lake, build live dashboards, and alert on important changes to make it easier for all data teams to analyze and understand their data.

Introducing Apache Spark 3.0: A retrospective of the Last 10 Years, and a Look Forward to the Next 10 Years to Come.

Matei Zaharia, Assistant Professor of Computer Science; Original Creator of Apache Spark & MLflow, Databricks

In this keynote from Matei Zaharia, the original creator of Apache Spark, we will highlight major community developments with the release of Apache Spark 3.0 to make Spark easier to use, faster, and compatible with more data sources and runtime environments. Apache Spark 3.0 continues the project’s original goal to make data processing more accessible through major improvements to the SQL and Python APIs and automatic tuning and optimization features to minimize manual configuration. This year is also the 10-year anniversary of Spark’s initial open source release, and we’ll reflect on how the project and its user base has grown, as well as how the ecosystem around Spark (e.g. Koalas, Delta Lake and visualization tools) is evolving to make large-scale data processing simpler and more powerful.

Delta Engine: High Performance Query Engine for Delta Lake

Reynold Xin, Co-founder & Chief Architect
Top Contributor & Original Creator of Apache Spark, Databricks

Redash on Databricks

Arik Fraimovich, Founder, Redash

How Starbucks is Achieving its 'Enterprise Data Mission' to Enable Data and ML at Scale and Provide World-class Customer Experiences

Vish Subramanian, Director of Data and Analytics Engineering, Starbucks

Starbucks makes sure that everything we do is through the lens of humanity – from our commitment to the highest quality coffee in the world to the way we engage with our customers and communities to do business responsibly. A key aspect of ensuring those world-class customer experiences is data. This talk highlights the Enterprise Data Analytics mission at Starbucks that helps to make decisions powered by data at a tremendous scale. This includes everything ranging from processing data at petabyte scale with governed processes, deploying platforms at the speed-of-business and enabling ML across the enterprise. This session will detail how Starbucks has built world-class Enterprise data platforms to drive world-class customer experiences.

Racism and Policing: The Path Forward

Speaker: Dr. Phillip Atiba Goff

Dr. Goff conducts work exploring the ways in which racial prejudice is not a necessary precondition for racial discrimination. That is, despite the normative view of racial discrimination—that it stems from prejudiced explicit or implicit attitudes—his research demonstrates that situational factors facilitate racially unequal outcomes.

Dr. Goff’s model of evidence-based approaches to justice has been supported by the National Science Foundation, Department of Justice, Russell Sage Foundation, W.K. Kellogg Foundation, Open Society Foundations, Open Society Institute-Baltimore, Atlantic Philanthropies, William T. Grant Foundation, the COPS Office, the Major Cities Chiefs Association, the NAACP LDF, NIMH, SPSSI, the Woodrow Wilson Foundation, the Ford Foundation, and the Mellon Foundation among others. Dr. Goff was a witness for the President’s Task Force on 21st Century Policing and has presented before Members of Congress and Congressional Panels, Senate Press Briefings, and White House Advisory Councils.

Rapid Response Research for COVID-19 and Other Challenges: Machine Learning and Data Science at Cal

Speaker: Prof. Jennifer Chayes

The Division of Computing, Data Science, and Society (CDSS) at UC Berkeley is advancing foundational research and educating the next generation of scientists and practitioners to leverage computing and data to take on pressing societal problems. In recent history, no societal challenge has been as far-reaching and critical as the COVID-19 pandemic. Solutions to this complex, global challenge will stress many aspects of computing and data science, from analysis of sparse, biased, and variable data; to simulation of large networks of human interaction; to sifting through biological and chemical data to find treatments and vaccines; to influencing both policy makers and public opinion more broadly.

In this talk, I will describe the overall vision of CDSS and how it is transforming education and research at UC Berkeley, building bridges across a diverse set of programs, and disrupting the traditional siloed university structure. The emergence of the COVID-19 pandemic has accelerated ramp-up of this new Division and the interdisciplinary research and collaboration it fosters. It also has highlighted the importance of delivering inclusive, rigorous data science education at scale, a hallmark of the Berkeley program. I will draw on examples from across campus of how computing and data are being used to address the pandemic, and how these challenges will stress the scale, performance, privacy, and resilience of the underlying data systems, driving a next generation of requirements for systems like Spark.

The Signal and the Noise: the Big Lessons from 20 Years of Data Analysis

Speaker: Nate Silver

In this technical keynote, Nate will highlight his biggest lessons from the past 20 years of data analysis and how it correlates to his methodology of building the election model and challenges in forecasting.

Introducing the Next Generation Data Science Workspace

Speakers: Ali Ghodsi, Clemens Mewald and Lauren Richie

It is no longer a secret that data driven insights and decision making are essential in any company’s strategy to keep up with today’s rapid pace of change and remain relevant. Although we take this realization for granted, we are still in the very early stage of enabling data teams to deliver on their promise. One of the reasons is that we haven’t equipped this profession with the modern toolkit they deserve.

Existing solutions leave data teams with impossible trade-offs. Giving Data Scientists the freedom to use any open source tools on their laptops doesn’t provide a clear path to production and governance. Simply hosting those same tools in the Cloud may solve some of the data privacy and security issues, but doesn’t improve productivity nor collaboration. On the other hand, most robust and scalable production environments hinder innovation and experimentation by slowing Data Scientists down.

In this talk, we will unveil the next generation of the Databricks Data Science Workspace: An open and unified experience for modern data teams specifically designed to address these hard tradeoffs. We will introduce new features that leverage the open source tools you are familiar with to give you a laptop-like experience that provides the flexibility to experiment and the robustness to create reliable and reproducible production solutions.

Simplifying Model Development and Management with MLflow

Speakers: Matei Zaharia and Sue Ann Hong

As organizations continue to develop their machine learning (ML) practice, the need for robust and reliable platforms capable of handling the entire ML lifecycle is becoming crucial for successful outcomes. Building models is difficult enough to do once, but deploying them into production in a reproducible, agile, and predictable way is exponentially harder due to the dependencies on parameters, environments, and the ever changing nature of data and business needs.

Introduced by Databricks in 2018, MLflow is the most widely used open source platform for managing the full ML lifecycle. With over 2 million PyPI downloads a month and over 200 contributors, the growing support from the developer community demonstrates the need for an open source approach to standardize tools, processes, and frameworks involved throughout the ML lifecycle. MLflow significantly simplifies the complex process of standardizing MLOps and productionizing ML models. In this talk, we’ll cover what’s new in MLflow, including simplified experiment tracking, new innovations to the model format to improve portability, new features to manage and compare model schemas, and new capabilities for deploying models faster.

Responsible ML – Bringing Accountability to Data Science

Speakers: Rohan Kumar and Sarah Bird

Responsible ML is the most talked about field in AI at the moment. With the growing importance of ML, it is even more important for us to exercise ethical AI practices and ensure that the models we create live up to the highest standards of inclusiveness and transparency. Join Rohan Kumar, as he talks about how Microsoft brings cutting-edge research into the hands of customers to make them more accountable for their models and responsible in their use of AI. For the AI community, this is an open invitation to collaborate and contribute to shape the future of Responsible ML.

How Credit Suisse Is Leveraging Open Source Data and AI Platforms to Drive Digital Transformation, Innovation and Growth

Speaker: Anurag Sehgal

Despite the increasing embrace of big data and AI, most financial services companies still experience significant challenges around data types, privacy, and scale. Credit Suisse is overcoming these obstacles by standardizing on open, cloud-based platforms, including Azure Databricks, to increase the speed and scale of operations, and the democratization of ML across the organization. Now, Credit Suisse is leading the way by successfully employing data and analytics to drive digital transformation, delivering new products to market faster, and driving business growth and operational efficiency.

Deep Learning: It’s Not All About Recognizing Cats and Dogs

Speaker: Kim Hazelwood

Based on a recent blog post and paper, this talk would focus on the fact that recommendation systems tend to be underinvested in the overall research community, and why that’s problematic.

Creating, Weaponizing, and Detecting Deep Fakes

Speaker: Hany Farid

The past few years have seen a startling and troubling rise in the fake-news phenomena in which everyone from individuals to nation-sponsored entities can produce and distribute misinformation. The implications of fake news range from a misinformed public to an existential threat to democracy, and horrific violence. At the same time, recent and rapid advances in machine learning are making it easier than ever to create sophisticated and compelling fake images. videos, and audio recordings, making the fake-news phenomena even more powerful and dangerous. I will provide an overview of the creation of these so-called deep-fakes, and I will describe emerging techniques for detecting them.

PyTorch: A Modern Machine Learning Research and Production Platform

Speaker: Adam Pazske

Over the past two years, PyTorch has become one of the most popular libraries used in machine learning research, with many of the groundbreaking advancements appearing alongside their PyTorch implementations immediately. Unfortunately, the adoption within the industry has been rather slow compared to the research community, and so one of the goals overarching current development is enabling easier transfer of ideas from academia to industry. This includes enabling easy model packaging and export, simple mobile deployments and Python-free execution – all while retaining the laser focus on great user experience. In this talk, I’ll cover the fundamental ideas behind the library, highlight recent advancements, present exciting upcoming features and talk about a few success stories to showcase the progress that has been made so far.

Science vs Covid, Lessons From

Speaker: Amy Heineike

The exponential growth of scientific research about the novel coronavirus is one of the truly inspiring and hope filled stories of this crisis – but it’s also a story of overwhelming data volume. AI has a crucial role to play in making information accessible and putting it in context. We built to connect the research to the news and social conversations about it, and to discover trends and highlight commentary. What have we learned so far, and what comes next?