Scaling Production Machine Learning Pipelines with Databricks

Download Slides

Conde Nast is a global leader in the media production space housing iconic brands such as The New Yorker, Wired, Vanity Fair, and Epicurious, among many others. Along with our content production, Conde Nast invests heavily in companion products to improve and enhance our audience’s experience. One such product solution is Spire, Conde Nast’s service for user segmentation, and targeted advertising for over a hundred million users. Spire consists of thousands of models, many of which require individual scheduling and optimization. From data preparation to model training to interference, we’ve built abstractions around the data flow, monitoring, orchestration, and other internal operations. In this talk, we explore the complexities of building large scale machine learning pipelines within Spire and discuss some of the solutions we’ve discovered using Databricks, MLflow, and Apache Spark. The key focus is on production-grade engineering patterns, the inner workings the required components, and the lessons learned throughout their development.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi, I am Max Cantor. I’m a software engineer of machine learning at Conde Nast. Conde Nast is a global mass media company with many brands both in print media and on the web, many of which you may be familiar with such as “Vogue,”

“Wired,” and “The New Yorker.”

Digital Scale at Condé Nast

As a global mass media company, we have 28, we serve 28 markets, have an audience of over 240 million. We have over a hundred million monthly unique visitors, hundreds of millions of monthly page views, and over 100,000 monthly new or revised pieces of content, so we have a lot of data, and Conde Nast, while it is a company that’s been around for a long time, our machine learning infrastructure has not. So, in thinking about where to start with building a machine learning infrastructure, we wanted to think about the key issues in the machine learning ops, or MLOps, landscape and look at what several of the more established tech companies are currently doing.

Landscape of MLOPS

So in terms of the landscape, we first have, we first identify datasets. So datasets, you need to be thinking about the file systems, how the data are partitioned. You need to think about accessibility, so things like standardizing your schemas, how frequently the data need to be updated, how they’re documented. You need to be thinking about the software development kits, so things like standardized libraries for your team or across your company for your various projects. You need to think about the data abstractions and objects-relational mappings, command line interfaces for developers and power users, and data cleaning. So if you have things like nulls or as the media company concerns with that, bots as opposed to actual users.

Next, the next key issue is modeling. So these are things like the libraries that you’re going to use for your model logarithms, so things like Keras, Arc MLlib, scikit, PyTorch, the deployment and configuration of your model architecture and how to build a unified framework across all of your different projects for different kinds of models serving different business goals. There’s scheduling. So the thinking about the building the feature sets and training the models and then also how to do that with big data. So dealing with parallelization, thinking about the logistics for batch versus streaming jobs.

And from there, the next key issue is inference. So this is how to package and deploy your trained models, especially when they get very large, thinking about the delivering of your results to your varied consumers. So this can be things like recommendations systems that actually go to end-users, but this could also be metrics, that get sent to a business analytics team or these could be metrics that get sent to a social media team to determine the virality of various articles and other content. This could include how to version and storage your model estimates and parameters and things of that nature, And then along those lines, our final key point, tracking. So this includes thinking about things like feature drift and inference decay, especially as a media company. The features that are predictive of audience behavior now might not be predictive a month from now. A model trained on data that is current might not be relevant to test on data a month from now. So along those lines, we also need to be tracking model evaluations, so things like ROCAC, cross-validation, and AB testing.

MLOPS Across the Industry

And we see that across the industry that major companies are doing these things that we have suggested, so for instance, Uber has Michelangelo and Ludwig. So Michelangelo is a platform that does MLOps at scale and deals with various points of data partitioning, model pruning, sharding of their deployments, all to make their big data more efficient. Then Ludwig is a data and model abstraction framework.

Facebook has FBLearner, which has experiment management through a custom typed UI.

Airbnb has Bighead, which deals with automated notebook to production pipelines. So they have Redspot and Deep Thought, which serve a similar role as how we use Databricks in taking notebooks on remote servers and productionalizing them. Netflix has Metaflow, which has a dataflow DAG paradigm for their workflow, and integrated within that are robust support for air handling, and then finally, many companies are beginning to forgo their own internal architectures altogether and use things like TensorFlow Extend and Kubeflow or Spell or Weights and Biases. So like TFX and Kubeflow, in particular, is something that Spotify has adopted. So with all of that said, thinking about the key issues, and thinking about how MLOps is being implemented across the landscape, Jamie will now discuss MLOps at Conde Nast. – [Jamie] So thanks, Max. I’m Jamie; I’m another software engineer here at Conde Nast, working on Conde’s machine learning platforms. Like Max mentioned, there are, you know, a fairly wide array of other companies devoting a lot of resources to stabilizing machine learning infrastructure for their own workflows, and we’re no different in media. So we have a pretty substantial array of different things that we want to be focusing on and using machine learning to increase value.

There are things like recommendation, personalization, just to improve the experiences of users reading different pieces of content across brands. We want to provide better targeted advertising to people and pocket users into the correct segments, based on what we identify as their interests. Then we want to be able to use our own content to extract data for topic modeling, feature generation, and other sorts of analysis. Then for virality prediction, like Max mentioned, we want to be able to serve content to users at a higher frequency if we’re able to assess that it’s something that a large number of users are gonna be wanting to read.

But there are certain sort of individual peculiarities that come with working outside of tech proper. So there are just things about the media industry that don’t totally match up with how we want to, as engineers, want to be handling problems. There are things from upper level management that are handled in sort of a different way than most engineers would be comfortable with, in terms of just sort of like prioritizing deadlines and speed, but may lead you to sort of be forced to sacrifice integrity on the software engineering side. Conde Nast is a really, really historic old company. It’s been around for quite awhile. The tech team is fairly new, however, and so we often find ourselves in the middle of kind of like long founded things that lead to a bit of certain chaos in trying to unify them in the way that you’d want them to work for a more cohesive software platform. Then there’s just staffing and technical constraints that we have to deal with. We aren’t staffed to the degree that Google or Facebook might be and have to sort of work within a very peculiar or special set of restraints in order to build the projects we want.

But in spite of that, we have been devoting a lot of resources over the past few years to creating like an in 10 machine learning platform so this is introducing one of our flagship machine learning products. It’s called Spire.

Spire Workflow Stages

The idea with Spire was to provide a tool that could facilitate every step in the machine learning life cycle for a particular model, make it something that was helpful for data scientists to do faster iteration and derive insights from experiments more quickly as well as relatively non-technical users who might want to run experiments independently, target certain user groups.

And so, the internal architecture relies on sort of two run times that both adhere to the dataflow paradigm that, as Max mentioned, has been more generally adopted across machine learning infrastructures.

So for each sort of or each thing we would want to model at a more granular level of the workflow, we want to break it out into something that builds datasets, does model training, and performs inference and collects evaluation metrics about the performance of that model. So we have abstractions and sort of a core Python code base that then references or is, then, type of SQLAlchemy to an artifact or repository somewhere that allows us to track all of the different possible configurations for every stage in that life cycle that I mentioned.

And then, so for each individual model like that, then at a higher level, we sort of replicate the same dataflow paradigm and encapsulate every individual model’s life cycle within the scope of that. And so we use our the Spire core code base within or in running as actually Airflow jobs and are hosted by Databricks, and then for a more general one, we’re using Astronomer, which is a managed Airflow service to schedule and execute all of the Spark jobs.

And so, just to go into slightly more detail about each of these projects, that sort of fall under the scope of Spire more generally, Aleph is our feature computation engine. It exists or it processes users or processes features at the user level. So we process around 50 million events per day and Aleph contains all of the transformers to take data or from the click streams, transform it into features that we know to be substantially predictive, and then rank them out, so the architecture looks like this generally. We currently have capability to read from two different streams. For both reasons of legacy and then also, or that kind of doubles, then, as additional support, so Kafka’s is our sort of standard stream that we’re reading clicks from events from, but in the event that there are issues with downtime there, we can fall back to Kinesis. And then the internals of Aleph are all transformers, entirely a Scala code base to form the features that we want to service downstream modeling tasks with, and these are all written out to Delta Lake, which provides us some really helpful features like snapshots of data and really increased querying time for accessing what are very, often very large datasets.

The next step is Kalos, which is our unified model interface library.

The goal with Kalos was to provide data scientists with the option of using any number of the different libraries to facilitate a machine learning project, so, you know, it’s like you could distribute training with MLlib, do something in scikit-learn or PyTorch or Keras and that Kalos would then, within the scope of Spire as a framework, have abstract all the peculiarities of whatever model architecture the data scientist wanted to use into this unified interface that, you know, would be able to execute within or at scale and in production, regardless of the back end.

So right now, the primary, primary library supported are Spark Mllib and scikit-learn. We, for really essential parts of the data science workflow, like versioning, serializing, and storing models, we’ve been using Mlflow to track, track experiments, save ’em for later, and load them up with the production environments for batch inference. And then we also have a segment of the library that’ll perform hyperparameter tuning with this library called hyperop that can also be plugged into the configuration of the model architecture.

And so now I’m going to hand it back to Max, who’s gonna talk a little bit about some of the sort of things that we thought about as we were, as we’ve been developing the platform.

– Yeah, I’ll start on this one, as well. Yeah, and I’m pretty much ready.

Thank you, Jamie. Yeah, so, one thing I wanna follow up on now that we’ve talked about the architecture is the transition from a set of data science workflows or data science projects to really what I think is core to MLOps, which is ML as a software product. So with that in mind, there are a set of things we’ve done to try to formalize and turn our architecture into something more like a product, one of those being standardized execution environments.

Execution Environments

So, as Jamie mentioned, we use Astronomer, which is a managed Airflow provider, so it provides things like Kubernetes clusters for hardware resource monitoring. It uses Docker images for different deployment environments. It provides a unified workspace, and it provides various deployment tools. And then our pipeline itself, Spire, is run on a AWS cloud managed through Databricks. So Databricks also manages various aspects of the environment. It spins up the ephemeral AWS clusters. It handles this through Databricks jobs. And then we also use Databricks interactive environments for development and for testing. And then finally, we also have various local development tools, so this includes a command line interface. It includes API wrappers, and we’ve also started using features such as Databricks Connect so that, even when we’re developing locally, we can leverage our, our standardized environments and our cloud computing resources.

So the command-line interface was built using Click, which is a command-line interface package for Python.

Spire Command-Line Interface

And it’s built on top of the main code base as a separable sub-package, and it allow for easy querying of the database, creation and modification of models, even execution of models, and it can do this across our various standardized environments. We see this as both a tool for developers, but also for what we call power users. So these are people who aren’t necessarily privy to all of the inner workings of the architecture, who don’t necessarily understand our data abstractions and the objects relational mapping and all of that. Or like SQLAlchemy users, but still regularly need to interface with Spire.

API Wrappers

And then we also have API wrappers, so these wrappers bridge the Airflow operators and the Spark jobs by way of the Databricks API. And this basically just streamlines our development process and allows us to test out across our various environments, new features, and enhancements on Databricks before producing them.

And so, now Jamie will go on to elaborate on more about our process and also where we’ll be heading in the future.

– All right, thanks, Max.

– (indistinct) – I’m sorry, we wanna be one more, one more down the line.

Rel Management

– Yeah, I should have, I should have picked, pinned next. – Ah, no, no problem.

Yeah. (laughs) Yep.

So thanks, Max.

One of the things that we’ve sort of started spending a lot of time thinking about over the course of developing all of the data science related software products here is this, that there is, it’s a very rapidly evolving space. There are a lot of things that might be sort of standard in more traditional software development, but still kind of remain unresolved or harder, at least for, to accommodate or to match with data science workflows. So everyone in data science is generally familiar with phenomenon like some very important production process that exists either in some rogue notebook somewhere and only that developer knows where it is, or a similar kind of thing where it’s like a untracked, anonymous EC2 instance that is also very integral to business and a bigger thing, and so, like, these are things that, in more traditional software engineering, are sort of taken for granted or givens but that we’ve made substantial efforts to try to, try to bring to our data science processes and, and encouraged data scientists to follow certain patterns and which allows for, you know, just the burden of confusion and things being all over the place without a central sort of mode for developers to work within. And so, things like this include versioning. We wanna make sure that, as substantial releases are made, or substantial changes to the architecture or new features are added for a project, we version them, cut new releases, and make sure that the binaries are available throughout a series of environments where data scientists and developers have access to them. GitHub repositories, which is another thing that sounds really sort of, sort of like a non-factor but that is not a (laughs) non-factor. And just make sure that we were able to sort of, make sure that things make their way and to get where we have version tracking and not just in notebooks somewhere.

And then, the final sort of box that we wanna check there for stability and peace of mind is a CI/CD system, and so, this is where we’ve integrated a lot of our testing insofar as we can for data science libraries. And various quality checks and just allow, you know, the most stability that we can for a software product process that demands a little bit more freedom than like regular procedure.

So these are things that we’ve, we’ve been working to solve, but there are, as of any large software product, a number of unresolved questions that we have or things that we wanna do for the future that I think bear mentioning. The first is, is model reduction, as I mentioned earlier, we’re running thousands of models in production. While this has been a real, real testament to the capabilities of Databricks and our own engineering team’s ability to scale things really successfully, we need to sort of ask ourself the question if whether or not we want to do this or what the real utility of having this many models in production is, and whether we could re-evaluate and leverage the flexibility of our infrastructure to reduce the number of models that we have in production while still solving the same problem as far as categorization problems go. The next is is we, and it’s partially wrapped into the first point is enhancing the model architectures that we’re using throughout Spire. We have a fairly cookie cutter arrangement for the architecture that we’re using for machine learning problems, but we want to provide better support, just extend, extend tools to users who are gonna be, or want to use PyTorch or Keras. Since we’re pretty Spark ML and scikit-learn reliant right now.

Then also on the note, the note of flexibility, one of the more locked in areas that we have is, is in how we build feature sets. We have certain ways to transform data that we know contain or are substantially predictive, but we want to allow people or give people the option of finding their own ways to transform data and mix it and drive from different sources to be used for features and models that can then be fairly easily moved in with production systems.

And then, the last of these is, first, whether or not we really want to have all three of these projects, Spire, Kalos, and Aleph, existing as independent projects in the first place. Like, might it be easier to manage things and provide same utility if we merge these all into a single entity and consider them just like fully from start to finish, everything that happens in the machine learning life cycle is just part of just one library. And then even further, Max sort of alluded to this, there are a few companies that are very large, large and successful tech shops, that see no need to build internal ML infrastructure tools themselves and can solve every problem that they need to solve via tools like Kubeflow or TFX, building additional infrastructure on top of those libraries as required to suit their needs as they deviate from the core platform.

So, these are all things that we wanna think about in the future, but for now, we have a very small but devoted and talented team of core engineers who have for a sort of, or a company with, that’s not pressed for resources, but may not have as many as a larger tech company has built really productionalized, scalable machine learning platform so we wanna thank them as well as Databricks for providing us tools like MLflow and Delta Lake that have really, really transformed and eliminated a lot of the complexity in the process. And so with that said, thank you, everybody.

Thank you!

And so with that said, thank you, everybody.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Max Cantor

Condé Nast

Max Cantor is a Software Engineer of Machine Learning at Condé Nast. He designs and maintains machine learning platforms that scale to thousands of models and terabytes of data in a production environment. He is an Insight Data Engineering Fellow, received a Masters degree in Cognitive Psychology and Cognitive Neuroscience at the University of Colorado Boulder, and graduated with Honors at the University of Michigan with a Bachelors degree in Psychology.

About James Evers

Conde Nast

I'm a Software Engineer based in NYC.