On-Demand Webinar and FAQ: Apache Spark MLlib 2.x: How to Productionize your Machine Learning Models
by Richard Garris and Jules Damji
March 28, 2017 in Engineering Blog
On March 9th, we hosted a live webinar—Apache Spark MLlib 2.x: How to Productionize your Machine Learning Models—to address the following questions:
- How do you deploy machine learning models to a production environment?
- How do you embed what you've learned into customer facing data applications?
- What are the best practices from Databricks on how customers productionize machine learning models?
To address the above concerns, we did a deep dive with actual customer case studies and showed live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
If you missed the webinar, you can view it on-demand here, and the slides and notebook are accessible as attachments to the webinar.
Toward the end, we did a Q&A, and below are all the questions with links to forums with their answers. (Follow the links below to view the answers.)
- I thought that Machine Learning (ML) is an upgrade from MLlib. Is MLlib 2.x more update to date than ML?
- PipelineModel instances all have Dataset objects as input and output, and creating a Dataset requires having a SparkSession active. (Right?) If I have a mainframe deployment environment where I just want to give it a single record and get a single record back, what are my options?
- Most of the models in MLlib support PMML export. What was the motivation for developing a proprietary real-time scoring model export?
- I assume this is a proprietary format for exporting the model. Why not use an open standard like pmml?
- Are there any MLlib standard implementations of clustering algorithms other than k-means?
- It seems like one might want to use similar DevOPs CD/CI techniques and apply to ML and model development. How do you see the flow and what products would help (e.g. like a Jenkins product, a build tool...?) to use with DevOps CD/CI scenario?
- What common APIs have you seen that are used for scoring an Apache Spark ALS model in real-time?
- 1) Why does the Spark model scoring (e.g decision tree ) make it hard or impossible to get a probability and easy to get a prediction (which is not very useful). 2) How can you export a model in a readable form (e.g. PMML) or generate code?
- Is this databricks library compatible with the dataset-based SparkML (as opposed to MLlib)?
- Can you give some prediction when RandomForest will become available in dbml-local?
- Seems like we should be able to make an AWS Lambda microservice that picks up the trained model from S3, and uses the dbml-local library to make the predictions?
- Can you give an example use case where a customer needed to train a model on Apache Spark, but deploy on an external system? How common is that?
- We've been waiting for dbml-local for a long time! Great addition! When do you expect all classifiers in spark to be available in dbml-local?
- Are all the spark mllib models available to use with dbml-local?
- Is there support to export the model as a pojo object?
- Is there any plan to implement k-mediods in Apache Spark
- Are you planning to provide dbml outside of databricks? 2. how does dbml relate to tensorframes?
- Can you comment on DataFrame model and MLlib for production?
- Do you support other data formats such as netCDF?
- Will scoring consider ML pipeline activities like feature extraction?
- Is there any plan to publish rest APIs in Apache Spark itself to submit spark jobs?
- Will the exportModel functionality available to other language like python or R?
- In the decision tree visualization, can the real feature names (instead of feature 1, 2, 3, etc.) be displayed in the visualization?
- As of now there is support only for logistics regression model to be used outside of Apache Spark?
- What were the pros and cons of the 3 different schemes that you presented to productionalize ML models? Why specifically demonstrate the third option?
- Do you plan to support other ML libraries in addition to MLlib?
- Don't you think PMML is the standard for exchange format for predictive models?
- Is dbml library available for community?
- Could you please clarify what this model score option (private beta?) is? Is that available to all paying customers? If we do not use that, I'd like to know what we have to do achieve the same thing with Databricks, Apache Spark and MLlib.
- For raw input, how features are being computed that are being passed to model which you showed in eclipse?
- How do you compare the quality and efficiency between spark ML 2.1 and scikit-learn?
- Where can you obtain dbml-local jar? Only available to Databricks customers?
- Today, some productionized machine learning models are updated each days. Do you have you a solution to obtain the optimized parameters model ? (RandomSearchCrossValidation, but it takes a certain time on a large and distributed configuration..)
- What if we build a modeling technology of our own ? (creating a modeling class, based on Scala or Python libraries of our own), how would we ensure this could be deployed using the same approach you've shown?
- I have found spark.ml gradient boosted trees to be slower than other packages, such as h2o sparkling water. Is there any focus on increasing the performance of the gradient boosted trees or better incorporating another package such as h2o sparkling water or xgboost4j?
- Can you talk about how you would replicate the Spark training pipeline (string indexing, vector assembling, etc) in this application?
- How would an ensemble of models run using this new scoring approach? By creating and saving an ensemble pipeline?
- What is the best way to deploy a predictive model or recommendation engine if the scoring environment is on an IOS app?
- dbml-local looks very much like mleap open-source project. Isn't it better for DB to contribute to that?
- Can Spark MLlib (or dbml-local) somehow read scikit-learn's model file?
- From performance perspective, have you done any performance comparison between spark.ml and sklearn (same algorithm and parameter)? And is there a list of algorithms that will run really well on Apache Spark?
- When using or doing the 'local model' option, calculating 100 features has overhead- is precomputing always necessary?
- Where can I find more information on dbml-local? Can't find it on your GitHub and not getting many results when searching on google.
- Can you talk about the option to save as PMML?
If you'd like free access to Databricks, you can access the free trial here.
Try Databricks for free