Tim Hunter is a senior AI specialist at the ABN AMRO Bank. He was an early software engineer at Databricks and has contributed to the Apache Spark MLlib project, and he has co-created the Koalas, GraphFrames, TensorFrames and Deep Learning Pipelines libraries. He holds a Ph.D. in Machine Learning from UC Berkeley and he has been building distributed Machine Learning systems with Spark since version 0.0.2, before Spark was an Apache Software Foundation project.
Software engineering evolved around certain best practices such as versioning code, dependency management, feature branches, etc. However, the same best practices have not translated to data science. Data scientists who update a stage of their ML pipeline need to understand the cascading effects of their change so that their downstream dependencies do not end up with stale data, or unnecessarily rerunning the entire pipeline end-to-end. When data scientists collaborate, they should be able to use the intermediate results from their colleagues instead of computing everything from scratch.
This presentation shows how to treat data like code through the concept of Data-Driven Software (DDS). This concept, implemented as a lightweight and easy-to-use python package, solves all the issues mentioned above for single user and collaborative data pipelines, and it fully integrates with a lakehouse architecture such as Databricks. In effect, it allows data engineers and data scientists to go YOLO: you only load your data once, and you never recalculate existing pieces.
Through live demonstrations leveraging DDS, you will see how data science teams can:
[daisna21-sessions-od]
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas' new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
Prerequisites:
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas' new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
Prerequisites:
October 15, 2019 05:00 PM PT
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: - how to effectively leverage both pandas and Spark inside the same code base - how to leverage powerful pandas concepts such as lightweight indexing with Spark - technical considerations for unifying the different behaviors of Spark and pandas
April 23, 2019 05:00 PM PT
Deep Learning is now the standard in object detection, but it is not easy to analyze large amounts of images, especially in an interactive fashion. Traditionally, there has been a gap between Deep Learning frameworks, which excel at image processing, and more traditional ETL and data science tools, which are usually not designed to handle huge batches of complex data types such as images.
In this talk, we show how manipulating large corpora of images can be accomplished in a few lines of code because of recent developments in Apache Spark. Thanks to Spark’s unique ability to blend different libraries, we show how to start from satellite images and rapidly build complex queries on high level information such as houses or buildings. This is possible thanks to Magellan, a geospatial package, and Deep Learning Pipelines, a library that streamlines the integration of Deep Learning frameworks in Spark. At the end of this session, you will walk away with the confidence that you can solve your own image detection problems at any scale thanks to the power of Spark.
October 2, 2018 05:00 PM PT
Data is the key ingredient to building high-quality, production AI applications. It comes in during the training phase, where more and higher-quality training data enables better models, as well as during the production phase, where understanding the model’s behavior in production and detecting changes in the predictions and input data are critical to maintaining a production application. However, so far most data management and machine learning tools have been largely separate.
In this presentation, we’ll talk about several efforts from Databricks, in Apache Spark, as well as other open source projects, to unify data and AI in order to make it significantly simpler to build production AI applications.
Session hashtag: #SAISAI2
October 3, 2018 05:00 PM PT
Deep Learning is now the standard in object detection, but it is not easy to analyze large amounts of images, especially in an interactive fashion. Traditionally, there has been a gap between Deep Learning frameworks, which excel at image processing, and more traditional ETL and data science tools, which are usually not designed to handle huge batches of complex data types such as images.
In this talk, we show how manipulating large corpora of images can be accomplished in a few lines of code because of recent developments in Apache Spark. Thanks to Spark's unique ability to blend different libraries, we show how to start from satellite images and rapidly build complex queries on high level information such as houses or buildings. This is possible thanks to Magellan, a geospatial package, and Deep Learning Pipelines, a library that streamlines the integration of Deep Learning frameworks in Spark. At the end of this session, you will walk away with the confidence that you can solve your own image detection problems at any scale thanks to the power of Spark.
Session hashtag: #SAISDL1
2017 continues to be an exciting year for big data and Apache Spark. I will talk about two major initiatives that Databricks has been building: Structured Streaming, the new high-level API for stream processing, and new libraries that we are developing for machine learning. These initiatives can provide order of magnitude performance improvements over current open source systems while making stream processing and machine learning more accessible than ever before.
October 24, 2017 05:00 PM PT
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Session hashtag: #EUdev1