Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS SageMaker for Enterprise AI Scenarios

Download Slides

Transformer-based pre-trained language models such as BERT, XLNet, Roberta and Albert significantly advance the state-of-the-art of NLP and open doors for solving practical business problems with high-performance transfer learning. However, operationalizing these models with production-quality continuous integration/ delivery (CI/CD) end-to-end pipelines that cover the full machine learning life cycle stages of train, test, deploy and serve while managing associated data and code repositories is still a challenging task. In this presentation, we will demonstrate how we use MLflow and AWS Sagemaker to productionize deep transformer-based NLP models for guided sales engagement scenarios at the leading sales engagement platform,

We will share our experiences and lessons learned in the following areas:

-A publishing/consuming framework to effectively manage and coordinate data, models and artifacts (e.g., vocabulary file) at different machine learning stages
-A new MLflow model flavor that supports deep transformer models for logging and loading the models at different stages
-A design pattern to decouple model logic from deployment configurations and model customizations for a production scenario using MLProject entry points: train, test, wrap, deploy.
-A CI/CD pipeline that provides continuous integration and delivery of models into a Sagemaker endpoint to serve the production usage

We hope our experiences will be of great interest to a broad business community who are actively working on enterprise AI scenarios and digital transformation.

Speakers: Yong Liu and Andrew Brooks


– [Yong] Welcome to our session, my name is Yong Liu, together with my colleague Andrew Brooks, we are very excited to share our experience on developing continuous delivery of deep transformer-based NLP models using MLflow and AWS sageMaker for enterprise AI scenarios. So here is an hour presentation outline with four sections, let’s get started with the first section with some introduction and background. So you may or may not have heard of Outreach, the company we work for. Outreach is the number one sales engagement platform powering more than 4,000 customers and the growing. This customers include many well-known startups and multinational companies. So, what is a sales engagement platform? A sales engagement platform encodes and automates sales activities such as emails, phone calls and the meetings, et cetera into workflows. For example, this diagram on the left shows a workflow where you can send an email and place a phone call on day one and then schedule a LinkedIn message on day three and then send another email and to make another phone call on day five. Due to such automation, sales reps performance and the efficiencies can be dramatically improved up to 10 times better by doing more effective one-on-one personalized outreach. In addition to automation, we are also adding intelligence into the sales engagement platform. That’s where a machine learning NLP and AI come to play. So how does machine learning NLP help? By using our product sales reps generate a lot of data such as emails, core scripts and engagement logs, et cetera. We then leverage machine learning and NLP to perform continuous learning from this data and a combined with knowledge to provide prediction, recommendation and a guidance for the continuous success of reps. This becomes a flywheel showN on the left. The reason for continuous learning is that sales process changes due to various reasons. We already observed that COVID-19 changes how sales reps using content and that’s why it is important to enable continuous learning and guidance. One particular use case we will highlight in this talk will be guided engagement which we will discuss later. However, before we really enjoy the benefits of machine learning NLP and AI, we have some barriers to overcome. Now I hand off to Andrew to continue the discussion.

– [Andrew] So now to set the stage and motivate what we’ve implemented and why, we’ll discuss some of the implementation challenges we faced here at Outreach. While we ground this discussion within the context of our experience, we expect many of these challenges to be shared across other enterprise machine learning teams. So, first challenge one, the dev-prod divide. If you’ve ever felt like a model developer, throwing a model over the wall to an Ops team and that’s the last you’ve seen of it or an Ops team catching a black box model with unclear requirements and interfaces, this cartoon might speak to you. When all developers are isolated, when they can’t see or use code from the prod environment, they can’t test on live data. The data streams that actually feed the model in the product or the application and why is this a problem? So, this often leads to misspecified model pipelines which produce bad predictions, which in turn produce complaints from users or even paying customers. And that campaign is compounded even more when model developers can reproduce those reported bugs or issues. Remediation could involve manually defining a dev and a prod pipeline understand root cause. We’ve been there, It’s costly, It’s inefficient, it’s not fun and it’s not necessary. Lastly, it’s also wasteful when production grade machine learning tool, machine learning tool is developed in prod can not be used for future model deployment, sorry, model development. This is a scenario where V one model has been trained, productionized and shipped, but V2 model training is restricted to the iterations of notebook and ad hoc code used to develop a V one model Challenge two, dev-prod differences. These are the scenarios where differences between dev and prod pipelines are inevitable or sometimes desirable, the good differences. One common difference is that the data sources for model training and model scoring and production are often different. Data used for training is typically from a persistent data store, analytics data used internally within an organization and build models and reports, without the danger of directly modifying customer-facing data or putting load on production applications. Prod data for scoring is often streamed not static. The data that is customer facing and might’ve been off limits to model developers during training. Often these data sources require different pre-processing pipelines. Second difference is inclusion of product specific or business logic that is desired for prod model scoring but not during training. For example, the scoring pipeline and prod might wanna suppress predictions where the model is not confident but no such filters are desirable or even existing for the training pipeline. So challenge three, arbitrary uniqueness. Without a framework codifying common design patterns, components have a tendency to be individually great and powerful but collectively suboptimal and even counterproductive when connected to other components in a pipeline or system. This is a scenario where the “whole” is not greater than the some of its parts despite the individual uniqueness and greatness of those parts. This is probably occurring when deploying each new model feels like a special case and reinventing the wheel for patterns that, for components that mostly kind of exist already. Not only does this involve a lot of extra development but it often produces pipelines that are not self-documenting. For example, if gates and deploy mechanisms of the pipeline are not consistently defined, it’s unclear how to even run the pipeline. The ability for reuse across projects and models and integration within a bigger system is limited. Naturally pipeline maintenance and extension is painful and inefficient, even more so and onboarding new engineers or developers. So last challenge, challenge four Providence. Specifically provenance from models to source code and source data. Why do we need this? If we don’t know what’s running in prod, we can’t reproduce issues and bugs reported by users as we discussed and challenge one. A second negative effect is that model pipeline changes might make teams grimace with fear rather than excitement for shipping improvements. This is often the case when we’re not confident that a mechanism exists to consistently determine exactly what’s running in prod, how it got there, how reproduce it or promote a new model to replace it. Lack of provenance can also compromise historical and temporal analysis that use model predictions. If released models aren’t version, this could compromise benchmarking or historical analysis that disguise a real-world behavior changes that are actually caused by this undocumented model change. So given these pain points and challenges, we’ll discuss how we overcome some of them in the context of a real use case at Outreach. Afterwards we’ll also share some of the challenges we continue to face and thoughts for addressing those in future work. So these cases we’ll walk through is the outreach guided engagement feature. It’s an inbox based-intelligent experience powered by an intense classification model under the hoop. When sales reps receive replies from their prospects existing or future customers, Outreach predicts and displays the intent of that prospect email perhaps positive, the prospect is willing to meet or in this case objection, the prospect already has a solution. Based on the predicted intent, relevant content is recommended to the sales rep. For simplicity, our talk will focus on just the intent prediction component, text classification not the content recommendation component. Well we discussed our use case and pain points, we’ll reference where we are in the machine learning full model life cycle. In this talk will focus on the middle four stages, starting with model dev. This is where we run many experiments offline to quickly iterate and develop the model logic of the winning model that we want to shed. In pre-prod we mature and package that winning model logic into software that gets published for use in production. For our use case, this publishes a Docker image and train model artifacts. The last two stages we’ll talk about model staging and model prod, is where the model is hosted exposing an end point for outreach, our product application to call. So starting with the model dev phase for our use case, this is where most of our development wasn’t Databricks or Jupyter notebooks and code repositories used only for offline model, developing model logic and running offline experiments. Even though we didn’t intend to ship this code, we did leverage MLflow tracking to tie experiments to results. This provenance prevented unnecessarily repeating the experiments multiple times and provided context and baselines for the winning model. Our model development often includes many modeling frameworks and techniques with each with different APIs for this particular model we explored SPMs using sklearn, fast text, flair. Ultimately we chose the hugging phase transformers library for its unified API to state of the art deep transformer architectures and pre-training language models. State of the art is a quickly moving target in this domain, so a project with an active community quickly closing the gap between published research and implementation was important to us. While strong and momentum, the huggy phase transformers library is a relatively young project and not yet a native MLflow flavor. We avoided arbitrary uniqueness by extending MLflow and writing our own MLflow flavor that lets us plug into the rest of the MLflow framework. So what does that mean? That means we wrote a tiny wrapper class shown on the left side that maps the huggy phase transformers library which itself perhaps a multitude of powerful architectures and models to the standard MLflow models API. And so what does this get us? Among other things, this buys us a standard mechanism for model serialization both saving and loading. We also wrote a transformer classifier class that’s scikit-learn pipeline compatible, so that we can chain our transform model with pre and post processing steps. And why do we need that? So we need this when we had scenarios like in challenge three, where trained in prod pipelines needs to be different because the data sources are different or there’s different business logic desire to be in production but not training. One example in our scenario is filtering email auto replies from production scoring but not during model training. And here we have just an example of a save transformer model from the MLflow tracking server API, at the MLflow tracking server and it’s associated artifacts shown on the left. The code snippet on the right shows just a couple of lines to lager load the model. This bypasses the arbitrary uniqueness involved with manually dealing those pythons pickle or other serialization protocols which can be finicky and pass the burdens to the consumers of the model. In pre-prod where we’re intentionally rewriting code and refactoring it from the dev notebooks into software will actually run in production, we adopted MLflows, MLproject pattern. MLproject is a fairly lightweight layer that centralizes entry points and standardizes their configurations, in environment, definition and management. Again, a cheap way to avoid the pain associated with arbitrary uniqueness by providing a self-documenting framework for pipeline. From a workflow perspective, we’ve found the flexibility for MLproject to run remote code and execute on remote clusters, actually also accelerated our development and tightened up some of our testing and code reviews such as reproducing model results. By referencing the code to run, by referencing the codes to run by github release tags or commit tags shown in red, we’re able to by ourselves, improve Providence tying the source code directly to the model artifacts and results in MLflow tracking. From a workload perspective this also prevents the hassle of manually cloning and running local code. Referencing the remote execution environments shown in green allowed us to develop code outside of Databricks notebooks in our IDs of choice, while also leveraging the power of the Databricks runtime for execution and powerful GPU based-clusters.

– [Yong] Now suppose we have a production grade trained NLP models for intended classification of emails. Does that mean we can now deploy to production? Not as so soon actually that’s because there are unavoidable differences in terms of logic in the deployment environment which we have been discussing in our challenge part. That’s why we create a three progressively evolved models for final deployment in a host environment in our case, that’s the SageMaker. First, we created a fine-tuned trained transformer classifier and then we’d wrapped a same classifier with pre process and post processes steps, which we call pre-score and post-score filters and this entire wrapped pipeline becomes a sklearn pipeline. This is basically the pipeline show on the middle, in the middle of the diagram. The reason we want to have some pre-score theater is because we want to additional logic such as whether to pass the email, to get the current reply message body on it or the full email thread or if the email size is too big, we may decide to not score it at all or maybe we want to use some cash mechanism, when the email content is exactly the same, we can return turn the cash and results of the prediction. So for the post-processing part, post-score filters, we could also add additional model metadata in the response, so that we can track Providence from the Kartra side, not that there is no model logic change in the classifier itself but having this second model pipeline is much flexible. Lastly, in the production environment, we don’t want the model to access our private github because accessing a private to github requires either a github token or SSH key which are security concerns in an enterprise production environment. So we create a third model which packages or private pay some dependencies into private wheels and then burn them into the Docker image so that at the deployment time, the model can reference them without accessing the private github. So now we are ready to fully automated the deployment through CICD tool. For CI part, which is the continuous delivery part or continuous integration part, we use circleCI. The CircleCI pipeline, not only does you need testing and a style enforcement but also runs the entire chain, test, wrap and the deploy all the way to a SageMaker endpoint at staging using a subset of the training data. This safeguards any code checking, note that several steps of CircleCI are reusing the same MLproject entry points we discussed earlier. This also allows us to even break the dev-prod divide, because we could also use the same CI pipeline to run some experimental code or model changes without reinventing wheels. Now for CD pod, that’s a continuous delivery and rollback, we as concourse at outreach. This pipeline has two where defined human gates, first, to trigger full model building, a designated person used to kick off the entire pipeline, and once it passes a regression test and to make sure that we are not getting a worse model than previous one, then a second human gate where we also need a person to promote the model to a production at a point. Here, the last step we show that we can deploy to AWS SageMaker, U.S East and the U.S West region. So in the CICD automation, we not only lock the model but also register model using MFflow model registry. From model registry, we can clearly see which version of the model is in production and which is in staging. And if you are curious about more details of the model, you can just click the version link and you can find out the provenance information of the model. So now we have down the four life cycle implementation, how, where did we adjust the four challenges we talked about at the beginning? We feel like we did pretty well on all four stages in terms of provenance checking except for the model dev stage. We also do the well in overcoming other three challenges. In Particular, we felt like we did very well in embracing the dev-prod divide and arbitrary uniqueness, do remodel pre-prod where we’d wrote the production code and the published transformer flavor model, making model code and the deployment process reusable and repeatable. However, one area we are not fully satisfied is do remodel staging, we did not really test with production where streaming traffic which could have been done through some AB testing mechanics before we promoted the model to production. That’s something we will adjust in the future. So in conclusion, we highlight four typical enterprise AI implementation challenges and how we solve them with MFflow SageMaker and the CICD tools Our intend classification model for guided engagement has been deployed in production and in operation using this framework. Our next steps, in addition to what we mentioned testing in staging for the model using AB testing, kind of a mechanism, we are also adjusting the following things. First, incorporating model in production feedback loop into annotation and model dev cycle. Second, we are further improving the annotation pipeline to have seminaries human-in-the-loop active learning and model validation. Finally, we would like it to thank everyone in the Data Science Group at Outreach who have contributed and supported this project. If you are interested in knowing more details about our experience and about Outreach platform, please contact us at this email addresses shown in the screen. Thank you very much.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Yong Liu

Yong Liu is a Principal Data Scientist at, working on machine learning, NLP and data science solution to solve problems arising from the sales engagement platform. Previously, he was with Maana Inc. and Microsoft. Prior to joining Microsoft, he was a Principal Investigator and Senior Research Scientist at the National Center for Supercomputing Applications (NCSA), where he led R&D projects funded by National Science Foundation and Microsoft Research. Yong holds a PhD from the University of Illinois at Urbana-Champaign.

Andrew Brooks
About Andrew Brooks

Andrew is a Senior Data Scientist at where he focuses on developing and deploying NLP systems to provide intelligence and automation to sales workflows. Previously Andrew was a Data Scientist at Capital One working on speech recognition and NLP and Elder Research consulting in domains spanning government, fraud, housing, tech and film. Before discovering machine learning, Andrew was an aspiring Economist at the Federal Reserve Board forecasting macro trends in Emerging Markets. Andrew holds a MS in Mathematics and Statistics from Georgetown University and BS & BA in Economics and International Studies from American University.