Niall Turbitt is a Senior Data Scientist on the Machine Learning Practice team at Databricks. Working with Databricks customers, he builds and deploys machine learning solutions, as well as delivers training classes focused on machine learning with Spark. He received his MS in Statistics from University College Dublin and has previous experience building scalable data science solutions across a range of domains, from e-commerce to supply chain and logistics.
Deploying machine learning models has become a relatively frictionless process. However, properly deploying a model with a robust testing and monitoring framework is a vastly more complex task. There is no one-size-fits-all solution when it comes to productionizing ML models, oftentimes requiring custom implementations utilising multiple libraries and tools. There are however, a set of core statistical tests and metrics one should have in place to detect phenomena such as data and concept drift to prevent models from becoming unknowingly stale and detrimental to the business.
Combining our experiences from working with Databricks customers, we do a deep dive on how to test your ML models in production using open source tools such as MLflow, SciPy and statsmodels. You will come away from this talk armed with knowledge of the key tenets for testing both model and data validity in production, along with a generalizable demo which uses MLflow to assist with the reproducibility of this process.
[daisna21-sessions-od]
Spark has become synonymous with big data processing, however the majority of data scientists still build models using single machine libraries. This talk will explore the multitude of ways Spark can be used to scale machine learning applications. In particular, we will guide you through distributed solutions for training and inference, distributed hyperparameter search, deployment issues, and new features for Machine Learning in Apache Spark 3.0. Niall Turbitt and Holly Smith combine their years of experience working with Spark to summarize best practices for scaling ML solutions.
Speakers: Holly Smith and Niall Turbitt
In this hands on tutorial we will present Koalas, a new open source project. Koalas is an open source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas’ new functionalities since its initial release, including Apache Spark 3.0, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas' new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
Prerequisites:
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas' new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
Prerequisites: