Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas

Download Slides

Does more data always improve ML models? Is it better to use distributed ML instead of single node ML?

In this talk I will show that while more data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video more data does not significantly improve high bias problem spaces where traditional ML is more appropriate. Additionally, even in the deep learning domain, single node models can still outperform distributed models via transfer learning.

Data scientists have pain points running many models in parallel automating the experimental set up. Getting others (especially analysts) within an organization to use their models Databricks solves these problems using pandas udfs, ml runtime and MLflow.


Try Databricks
See More Spark + AI Summit Europe 2019 Videos

« back
About Thunder Shiviah


Databricks Solutions Architect and ex-McKinsey Machine Learning Engineer focused on productionizing machine learning at scale.