On July 9th, our team hosted a live webinar—Scalable End-to-End Deep Learning using TensorFlow™ and Databricks—with Brooke Wenig, Data Science Solutions Consultant at Databricks and Sid Murching, Software Engineer at Databricks.
In this webinar, we walked you through how to use TensorFlow™ and Horovod (an open-source library from Uber to simplify distributed model training) on the Databricks Unified Analytics Platform to build a more effective recommendation system at scale.
In particular, we covered some of the newly introduced capabilities to simplify distributed deep learning:
- The new Databricks Runtime for ML, shipped with pre-installed libraries such as Keras, Tensorflow, Horovod, and XGBoost to enable data scientists to get started with distributed Machine Learning more quickly
- The newly-released Databricks HorovodEstimator API for distributed, multi-GPU training of deep learning models against data in Apache Spark™
- How to make predictions at scale with deep learning pipelines
If you missed the webinar, you can view it now. Also, we demonstrated the following notebook:
Toward the end, we held a Q&A, and below are all the questions and their answers. You can also provide us with your feedback on this webinar by filling-out this short survey, so that we can continue to improve your experience!
Q: Can you perform hyper-parameter tuning of the model in TensorFlow in a distributed fashion?
There are a number of ways you could do this including but not limited to:
- Train multiple single-node models in parallel on your cluster (i.e. one model per machine), or
- launch multiple distributed training jobs and do a grid search across those distributed training jobs.
In terms of what the HorovodEstimator supports, we are exploring adding functionality for it to work with MLlib’s native tuning APIs so that you can kick-off hyper parameter tuning through multiple distributed training jobs.
Alternatively, Deep Learning Pipeline (an OSS library from Databricks) allows for parallelized hyperparameter tuning of single-node Keras models on image data.
Q: What does ALS stand for?
ALS stands for Alternating Least Squares. It’s a technique used for Collaborative Filtering. You can learn more about its implementation in Apache Spark here.
Q: In your demo how many servers did you use? how many cores?
Here is our config for this demo:
- Databricks Runtime Version: 4.1 ML Beta
- Python Version: 3
- i3.x large (1 Driver + 2 Workers)
Q: Was the data replicated on each local machine? Does each worker node use all the data while training?
A unique subset of the training data is copied to the local disk of each machine, so each worker node trains on a unique subset of the data. During a single training step, each worker processes a batch of training data, computing gradients that are then averaged using Horovod’s ring-allreduce functionality and applied to the model.
Q: How do you compare dist-keras and Horovod?
Dist-keras is another open source framework for training a Keras model in a distributed manner on a Spark Dataframe, and is another great solution for training Keras models. We worked with both. One of the thing we liked about Horovod is that it comes with benchmarks from Uber, and it’s used in production by Uber. For that reason we decided to build the HorovodEstimator around Horovod, due to the fact that it’s used by large industry company, and will be well supported moving forward.
Q: How long does single machine training take vs. distributed training?
This is a good question – in general there are many factors that influence performance when comparing single-machine and distributed training. Distributed training incurs communication overhead from communicating gradient updates, but also allows for speedups by processing batches of training data in parallel across machines. With HorovodEstimator there’s also a fixed startup cost of writing the input Spark DataFrame in TFRecord format on the local disks of our cluster nodes. Our recommendation is to try single machine training first and explore distributed training only if single-machine training does not scale sufficiently.
Q: Hey, thanks for the talk. That was a great notebook scenario, but in a team of multiple data scientists working on different experiments on little different datasets, do you have functionality to track that? What other collaborative features do you support?
This is a great question! We recently announced a new open source project called MLflow just for that. With MLflow, practitioners can package and reuse models across frameworks, track and share experiments locally or in the cloud, and deploy models virtually anywhere. In addition, Databricks provides shared notebooks for Python or R development that also support track changes and versions history with GitHub.
Q: I don’t have much data and I’m using LSTM model and TensorFlow. Will you still recommend me Databricks and Apache Spark ?
Apache Spark is great for any type of distributed training or ETL, it unifies data processing with machine learning, but still needs effort to be installed, configured, and maintained. Databricks gives you a flexible way to run jobs from single node to multiple nodes use cases on Apache Spark. You can easily add more nodes to your clusters, with a click of a button. Databricks unifies all analytics in one place so that you can also prepare clean data sets for training your models. Databricks Runtime comes with TensorFlow pre-installed and configured, along with Keras, Horovod, but also XGBoost, scikit-learn, etc… giving you maximum choice and flexibility to build and train ML models. Last but not least, Databricks notebooks provide a collaborative environment to manage the entire lifecycle in one place, from data prep, to model training and serving.
Q: Can you train binary model with HorovodEstimator?
Yes. HorovodEstimator API simply provides a way to run any TensorFlow code in a distributed setting against large scale data, including binary classification, for example.
Q: Does the cluster support multi-tenancy?
Yes, you can have multiple notebooks attached to your cluster, or multiple users on your cluster. You can also manage cluster permissions at various levels on Databricks for maximum flexibility and security.
Q: Could we share the scripts and slides of the presenters after webinar?
Q: What is the difference between Apache Spark and Databricks?
Databricks Unified Analytics Platform provides a hosted version of Apache Spark both on AWS and Azure, and much more. We provide built-in notebooks and APIs to manage and automate analytics in one place, as well as additional optimizations to Apache Spark, including Databricks Delta Lake that is up to 100x faster than Apache Spark on Parquet, Databricks Runtime for ML, and the HorovodEstimator. We also significantly simplify DevOps with auto-config and auto-scaling of clusters, and provide enterprise-grade security features and compliance with a HIPAA and SOC 2 type 2 certified platform.