Operationalize Apache Spark Analytics

Download Slides

Apache Spark is a unified analytics engine for large-scale, distributed data processing. And Spark MLlib (Machine Learning library) is a scalable Spark implementation of some common machine learning (ML) functionality, as well associated tests and data generators. Data scientists are massively introducing such models in many organizations. But because it’s a new open source technology, it takes time for deployment from IT personas. The result is that several models trained, at least as many are parked in project repositories and only a few of them are really deployed. For all these reasons, Business are seeking for solution to orchestrate and manage their models in a way that company may speeds up model deployment without losing model governance Using SAS Model Manager and SAS Workflow Manager on SAS Viya Platform, the attendees will learn how we provide a model life cycle that govern and orchestrate Spark Mlib models integrating the Apache Spark REST API service (Apache Livy) with SAS Workflow REST API Services. With our work, we provide a business process management solution for build, register, compare, test, approve, publish, monitor, and if needed retrain those models in an automated and controlled manner at the same time. At the end, this automated architecture is a build-once but use-many BPM solution that reduces manual human intervention and accelerates customer capabilities of operationalizing their Spark Mlib models keeping the government of its analytical environment.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi, everyone. And thanks for enjoying our session. I’m Ivan Nardini. A senior Customer Advisor for Model Ops and Decisioning at SAS. And today with Artem Glazkov we are going to show you, how we operationalize Spark Analytics with SAS.

Operationalize Apache Spark Analytics

This is the main content of the agenda. I’ll show you the two main Model Ops option SAS has to manage the entire model life cycles for Spark Models in our analytical platform SAS Viya. The first one is with PMML technology, which is one of the main standard for statistical and data mining models. The other one is based on Apache Livy. Which is a service to submit Spark jobs, in a loop cluster via REST API. The session is based on an opportunity I had with an Italian customer last year. And in particular, we had a followup related to the livy solution. That’s why I let Artem to provide you a demonstration about that at the end. But first why we are here today. Since I joined SAS one years ago, working on Model Ops and decisioning, I noticed that customers can easily prototype machine learning models, but only few of them, they are able to operationalize their models. Indeed, building a machine learning system may be difficult and at the end it’s expensive maintaining it, because there are some hidden complexity to deal with. About our customer, he was not able to monitor a model performance decay, indeed models are not adaptive to the real world changes.

Mode Ops Challenges

For example, if data change models will break and you will not know if you don’t have a system to monitoring their performance. That’s why tracking model performance is one of the real challenge for Model Ops. Another concern was related to collaboration. They developed models in an environment with lack of collaboration. This implies that models were not ready to take decisions. Last challenge was about retraining. As model performance monitoring, figuring out the right model in the right moment for retraining is another bottleneck of machine learning systems. To address these challenges, SAS has two integrate components in its platform.

How we meet ModelOps challenges

SAS Model Manager and SAS Workflow Manager. And essentially they provide a model repository. So one place to store all the models, integrated with two build-in scoring engines, like CAS and MAS, both for in batch and the real time scoring and the possibility to integrate external engines. This kind of integrations allow us to automate all repetitive model management tasks for different type of models, both open sources and SAS models. And then because you get a unique environment, you can assess these models in a build-in and a customized model quality reports. And powered with this capabilities, customers can realize the additional values. Indeed, if we assume that the model is deployed it can receive alerting triggers for retraining, and redeploy the model in the right moments. And that generates the additional value.

Said that, before to show how we can achieve these results, We have this question for you, how you can operationalize our models? Please just type your answer in the chat.

In SAS, we operationalized part models with collaboration, Indeed, we take advantages to open source technology like PMML and Livy Spark, and enhance them with Model Ops capabilities, thanks to SAS Model Manager and SAS Workflow Manager.

So let’s start in talking about the PMML options.

As you probably know, PMML is the main standard, for deploying machine learning models. PMML enables model development on one system using one application and then deploying the same model on another system using another application. Because it use XML configuration file to convert these models. Suspensory these standards since its birth. And nowadays it has a developer communities that takes it updated compared to the new models technology. There are a lot of side projects around PMML. One of them is the JPMML-Spark Machine Learning library, which is a Java library that essentially allow you to convert Apache Spark pipelines to PMML format. And because our customers use Spark Models. We just use Python wrapper of these library, which is the pyspark2pmml package. And as I mentioned, thanks to this library. You can just fit your pipeline in your Spark session, and then you can use the PMMLBuilder class to create the PMML file. Now, lucky for us, SAS Model Manager has the technology to deal with PMML that allow us to convert pyspark models in SAS code and provides you a SAS technology stacks to accelerate scoring processing database. In particular, once you train the model, and you generate the PMML version of your model, as you can see in the picture, you can register the model in a SAS Model Manager using both the user interface or the REST API. And once you version your model, you can deploy monitoring and retrain this model in database, thanks to the SAS Workflow Manager and the SAS In-database technology. As I mentioned and just as a reminder workflow manager in this solution it’s optional. It depends how much do you want to automate the entire process?

And because our customer ask us to automate, it’s model life cycle, we build these workflow process. Artem will give you more details about each components, but the idea here is essentially automation. You can build this kind of process and automate all the steps of your model life cycles.

PMML approach Pro and Cons

So what about the Pros and Cons of this approach? The main point here is that you get a SAS In-database technology for accelerate your scoring process. But on the other side, you have to know that PMML does not support a whole, the algorithms coming from the Spark Framework. That’s why we decide to think about the plan B.

And our plan B was the Apache Livy solution.

So what Apache Livy is? Apache Livy is a service enables easy submission of Spark jobs or any kind of Spark code from a client to the Adobe Spark Environment. So the idea here was, why we can’t consider to use one of the component on the platform as a client, in particular the Spark workflow, or the SAS Workflow Manager.

So like we did so far, we train our models, we register the parquet version of our Spark machine learning models and the scoring code. And then we use the SAS Workflow capabilities, in particular the job execution and the rest API service, to submit the scoring REST API call gets back the scoring data and generate the Performance monitoring reports. And based on these reports, we decide if we need to retrain the model, or we just wait to another scoring time.

This is the high level solution

of what we did.

As you can see, compared to the previous one,

the main thing that changed are related to the SAS Workflow Manager because in the solution it’s the main component, it’s our client that sent the REST API to the Apache Adobe Spark Environments. And we lose the In-database technology because we manage Spark models without SAS.

So as before this is the process we built for these scenario. And as you can see, we just automate the task of scoring,

performance monitoring, and once the report was generated, the user reviewed the report and decide if the models needs to be retrained or not.

Finally about the Pros and Cons of this second solution. Essentially the main point here is you get a Native integration, you don’t have to convert the score code or manipulate it. But on the other side, you have to spend some time on Livy server configuration. So, now that you know how we manage Spark model, I would invite Artem to show you how we automate the model life cycle of a Spark churn models, with SAS Model Manager, SAS Workflow Manager, and the Livy servers. – [Artem] Yeah, thank you, Ivan. Now I’m going to show the technical scope of this topic. I will cover one of the option when we govern the spark ML model using SAS Model Manager on the Viya, and Apache Livy REST service. So here you can see the graphical user interface of model manager repository. Here we have got all models registered in our organization. Model scheme here from any development environment we want to use, it could be SAS Spark or TensorFlow.

Models are grouped into projects. We may think of projects as a scope of solving certain business problem. In project we can have one or more models. So we can run a competition between algorithms, versions, frameworks, in order to choose champion, that we will put into production.

For our demo case, I have creative spark ML Demo Project.

As you can see here, we have not got any models registered to this project. So let’s do a model registration.

I click import.

And at this point I have already trained a model using spark MLlib and back as a result, including score code and metadata into zip format. All I need to do is to pick the location of this zip archive. and click import button.

Now, the model was successfully registered to the project. And involves dimensions that in addition to GUI approach, I can do registration publishing another governance task with couple of lines of code using REST API calls to SAS platform. Now let’s examine the content of our model.

Here is how our model looks like. We can see all necessary information about it required to properly run it into production and reproduce it. We can have a train code, score code and other supporting files. On the variable tab we can see specification of model input data and output of the model as well.

On the property step, we will find additional information like who and when graded or modified the model, what kind of algorithm is used and so on. This information would be helpful in order to get clear understanding regarding how and when the model should be used. Now, let’s return to the project tab.

As long as we have already trained and registered our spark ML Model, we can start automated governance process. In order to do this, we should move to the workflow tab of the project. Here, we will launch the process.

We choose the process type that we want to run. With respect of our project

and enter some parameters of the workflow. For example, when we want to start using our model and how often we want to run batch scoring or performance assessment.

Now let’s talk about workflows into details.

Workflows are managed by the tool, named SAS Workflow Manager. It’s integrated and delivered in bundle with SAS Model Manager and provide a functionality to automate each model life cycle task. In workflow manager you can see two tabs definitions for designing and modifying workflows and instances for workflow administration and monitoring purposes. Let’s dive now into the workflow we have just started for our spark model project.

Here we got a graphical representation of the process we just launched. What makes sense is that having such automated process, we can orchestrate both running the scripts and interaction with users. By service task and users task respectively. User tasks allow us to directly interact with roles involved in model management process. We can negotiate with developers, model validators, model stewards and so on. And service tasks allow us to run any kind of code. We can run training, retraining processes, publishing, batch scoring, send notification and so on. So in order to create such workload definitions, we just need to drag and drop objects located on the left panel of this diagram, fill their properties and connect them between each other in the desired logic. So in our case, we are here at a starting point. And then,

we move to the timer object. When it spires up according to our inputs, we will run a batch scoring job. Using this kind of job is a easy way to parameterize script regarding repeatable model governance actions. In this service task we’re on job, named Run Livy script with several arguments, such as project ID and task name.

Now let’s look, how this job definitions looks like.

Here is our job definition, run livy script. Parameters, task name, and project ID that we get from the workflow, will be sent directly to the Python script and Livy services. As a result, our training and scoring jobs will be executed at Sparks cluster side.

After batch scoring, we are creating model performance report. That ‘ll be around health check for the model. In order to run these kinds of reports. we use out of the box functionality of REST API of model manager. Just sending to the system ID of the report we want to be formed.

Then we notify model Steward and ask him to review these reports. And decide basing on that, should model be retrained or not.

If model Steward will not respond in one day, we will resend the notification.

So we are acting as a model steward role responsible for model performance on the production stage. We ask it, check health state of the model. This kind of questions are becoming most important, in current circumstances due to rapid changes, in economy and customer behavior. In order to answer this questions, model steward should look at model manager out of the box model performance reports.

In order to create these reports we just need to open the wizard and provide parameters like what model should be monitored and the location of the dataset model phase at the production stage. As a result, we will get these bunch of reports that will help us to track the grade of model predictive power. For instance, we can have a look at the Gini index dynamics over four quarters of production history, and we can see that it decays to level beyond 0.5.

So based on these reports, we can answer to the questionnaire form that model quality is not good. And we want to do retraining.

After we choose the retraining option, the model will be trained on the spark cluster side

And after that, new version of the model will be registered, in the repository of model manager. Let’s now switch to the model manager view again to our project and validate the new version of the project was created.

So that’s it, we have covered the way how we can govern one certain model. But typically organization have plenty of models. And for analytical team lead, it is good to have a big picture regarding the state of all models they are working with. In order to do that on the same platform as a SAS Model Manager, we create Model Ops dashboard in order to address the most important questions faced by our organization while maintaining models as enterprise level. For example, using this kind of report, we can monitor model performance over all business units. By reviewing this graph, we can figure out which models are performing well and which model should be retrained.

And here we can track time spent on each phase of model life cycle. And control a slave for each step, as well as overall time to market of our analytical projects.

On this step, we can see workloads of our model place in production and examines the adequacy of our models and underlying hardware.

Based on the fact that all models are data-driven, variable shift is one of the key indicator to track. For example, here we can see that almost half of our variables consumed by the models are at risk. Meaning that they have significantly changed in their distribution. We can see exactly what variable drifted the most. It is important to track these changes in input data, understand this changes and elaborate how models can be adapted to this change.

So as a result of implementing such automated model governance approach, you can be sure that your spark models are getting to production in proper way and stayed up to date with respect to the rapidly changing world. Thank you for your attention.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Ivan Nardini

SAS Institure srl

Ivan Nardini is a Customer advisor specialized on ModelOps and Decisioning. He’s been involved in operationalizing analytics using different technologies (both SAS native and Open Source) in a variety of industries. His focus is on providing solutions to operationalize analytics and optimize business decisioning processes. To reach this goal, he works with software technologies and cloud.

About Artem Glazkov

SAS Institute

Artem is a Senior Consultant in Advanced Analytics Platform Practice in SAS Russia. He works closely with banks, insurers and retailers. His primary interest is related to conquering ‘the last mile’ of analytics. He wants companies to be ensured that the lifecycle of their models is efficiently maintained, meaning that all kinds of ML models, despite of the development framework, are properly stored, validated, deployed and monitored.