Condé Nast is a global leader in the media production space housing iconic brands such as The New Yorker, Wired, Vanity Fair, and Epicurious, among many others. Along with our content production, Condé Nast invests heavily in companion products to improve and enhance our audience’s experience. One such product solution is Spire, Condé Nast’s service for user segmentation and targeted advertising for over a hundred million users.
While Spire started as a set of databricks notebooks, we later utilized DBFS for deploying Spire distributions in the form of Python Whls, and more recently, we have packaged the entire production environment into docker images deployed onto our Databricks clusters. In this talk, we will walk through the process of evolving our python distributions and production environment into docker images, and discuss where this has streamlined our deployment workflow, where there were growing pains, and how to deal with them.
Harin Sanghirun: Hello everyone, I thank you for joining us today. My name is Harin Sanghirun and here with me is Max Cantor. We are from the data science group at Condé Nast and today we’ll be giving a talk about, bring your own container using Docker images in production and how it helps us streamline our processes. First some introduction about Condé Nast. Condé Nast is a global leader in media, featuring many iconic brands, such as The New Yorker, WIRED, Vanity Fair, Epicurious and many more. Some of the example of machine learning tasks we use here at Condé Nast include content recommendation and audience segmentation.
Specifically today, we will be talking about our audience segmentation platform called Spire. Spire is a platform for user segmentation and ads targeting, which analyze over a hundred million user on a daily basis. Last year, we also gave a talk about how we scale production machine learning pipeline in Spire for over 1700 model. That help score use over a hundred million user every day. Linked in this slide is also the link to that talk so if you’re interested, please be free to go watch it after this one.
As an overview of today’s presentation, we will be breaking it into three parts. The first part is the high level overview of the Spire architecture and how using Docker container streamline that process. Next we’ll be going to a detail how we actually implemented Docker in Spire and last we’ll talk about our learning experience from the process itself. Getting into the high-level overview of how Docker streamlines Spire deployment. From a thousand foot view Spire is essentially just a platform that ingest data from our first, second and third party sources.
It performs machine learning tasks such as pre-processing, training, scoring, model lifecycle management. For example, training each model on a weekly basis and running the latest model or the best model on user on a daily basis. Then once we have the score, we upload the data into a Delta Lake that gets read by downstream processes. Most of these processes can be categorized into two categories. One is ad targeting and the second one is content recommendation. The way that user will see these by upward will be either through more relevant ads for the user or recommendation of the content that might be relevant to that user.
Digging down a little deeper Spire contains three main components. The first component is called Kalos, which is our modeling library. Kalos handle things like model interface standardization, serialization, versioning and tracking at hyperparameter tuning. Kalos is also fully integrated with MLflow and Spark MLlib. Next, it’s our more standard software application, which is this Spire Core Library. This component handles the database management, scheduling and orchestration. The last component is the Data Science Common Library, which is just a utility library that we share with team members of our team.
Here you can see the diagram of how we deploy each component of Spire to production without a container. This is what we were using previously. Each component of Spire will have its own deployment pipeline and this pipeline mainly involve two things. One is the quality control which is done through automatic testing and code we use. When the automated testing passes and the code got approved, we have an automatic deployment system which create a Python wheel and then upload it to DBFS. As you can see, each of the component of Spire has its own separate wheel and it will be version according to the standard. Then downstream task, which uses Spire will install these wheel individually, each sub-component individually to perform some tasks.
These wheel can be either installed at a cluster level, also through pip-magic in a notebook. As you can see the arrow here shows the dependency of each task to each wheel, and then this can get messy quite quickly since the task has the flexibility to choose its own version of Spire and what version of other components in Spire. It’s also difficult to be confident that each of the version works together perfectly. Next is pretty much the same use case but with Docker. As you can see, the pipeline now looks much cleaner. The pipeline with Docker first start with a Docker file. In this Docker file we specify… We use Docker file to specify the Docker image, which we install each of the component inside this image from the straight from the source so no more wheel.
Then once we build up this image, this image gets into automated testing to make sure that each of the component work together correctly. Then once that passes, it automatically gets uploaded to a container registry. In this case, we using GitHub container registry. Then from there, the Spire downstream task just have to reference one version of the image to select which version they want to use. This means that they don’t have to specify three different things. Also this single image is also tested to make sure that the three different components work well together.
To give some recap about the benefit of using container and how we use container. First, we pre-package all of our dependency into a container image where each image represents a tested combination of the package, that is also linked to a specific Databricks runtime and a specific version of Spire. One obvious benefit you can see from the diagram is that we have fewer pipeline to manage. Another benefit is that we have the engineer like me and Max have explicit and upfront control over the dependency version, the version of each of the sub components.
This means that we can select the appropriate version for the version of Spire that we want to run and then decided and package it as an image. When the end user uses Spire just select from one of the already tested combination that we have already designed. Next, before we go into the implementation detail, I’ll give you some introduction about container… How container works on Databricks. To use container on Databricks, it consists of four main steps. The first step is to choose a best image. For the majority of the case, the user will be using the standard base image, which is the image that you get when you create the default Databricks cluster for a notebook job. If you have more specific requirements, there’s also a wide variety of images that you can choose from.
For example, the minimal image is just the requirements that is needed to start a cluster but you won’t be able to run a notebook for it. This might be something that is suitable for something like a jar task, or if you have a machine learning model that requires accelerated training with GPU, then you might start out with a GPU image. In this slide, I have also linked to the repository that stores the definition of these images. If you want to learn more, feel free to dig into the Docker file there. Next after you have built your image, you just have to add dependency that you want to be packaged in that image. In our case, this not only consists of running pip install, the source code of the three components Kalos, Spire and Data Science Common.
Additionally, if you have open two binaries that you want to rely on, you cannot install that as well since the best image is based on Ubuntu. After you have defined your image and build it, the next step is to push it up to a Docker registry. There are two recommended registry here. One is the AWS ECR and the second is the Azure Container Registry, but also any registry that supports basic authentication works as well. In our case, we have chosen GitHub container registry and with authentication through the basic authentication because [inaudible] has turned out to work the best for us.
After you have chosen your base image, decide what to include in those image and push it up to a continent registry. You can now start a cluster with that image. To do that, if you were using a UI, you would go to the create cluster UI and then that would be a check box saying that you want to start the cluster with your own Docker container. Once you have clicked that check box, there will be an option for you to put in the URL of your container and then the authentication that you would like to use and then you’re done.
You have a cluster preload with everything you need. No more fiddling with the libraries at a cluster level or at a notebook level. Next Max will give you the details of how we implement the Docker within Spire and the learning that we came out with. Max…?
Max Cantor: Thanks Harin. Yeah, so now that we’ve discussed the basics of Spire and of containerization on Databricks, I’ll talk about some of the more specifics in our use case. We use Databricks minimal as our base image and using Ubuntu 18.04. Then in our case, we actually created our image before DBR 7.x functionality had been included in the Databricks base images. We created custom DBR 7.x functionality building on top of Databricks minimal. In addition to that base image, we also then include the Spire package itself, as well as the afore mentioned sub packages like catalogs and data site comments. Then we also include all of the dependencies for all of these packages via their requirements of TXT files.
Then as for [inaudible] dataset, we host these on GitHub packages, ghcr. We have two primary packages, the production package, which is Spire and the development package, which is Spire dev. However in each case, we can host multiple images per package. In the case of production for instance, we’ll have a tag for latest, one for stable, stable being usually what we’ve actually have in deployment and then any number of older versions as well. Then likewise, we can push manually built development packages. These could be features that we’re not ready to deploy yet, but still want to be able to use in some out-of-band fashion. They can also just be for the purposes of development. These are tied into the GitHub release tags or the version number for the package. It’s a very streamlined process and it also integrates really well with our CI/CD pipeline.
We use get GitHub action CI/CD. In our case, when you push a commit such as to a PR for instance, it will go through our pytest suite where it will run the test on Ubuntu, as well as Mac. In the case of Mac, it’s going to be running the pytests from a clean environment but as if it were a local development. However for Ubuntu, it’s going to use Docker compose to actually run the tests through the Docker image itself or through the Docker container itself. Then in the case of release tags, it automates the build and deploy process of the Docker image containing Spire and all the dependencies and sub-modules and so on, which can then be passed to a Databricks job or to our airflow deployments in production.
Now I’ll discuss the various pros and cons, the things that we’ve learned over the course of this development. One thing I’d like to stay up front is that while we do have some cons, the pros far outweigh the cons. This has been a really great advantageous feature for us. In many cases, these cons are things that we anticipate we’ll be able to solve over time. Especially we have good communication with Databricks and they’re often working with us to add new features. I do want to state that upfront but I still think it’s important that we discuss some of the learning curves in addition to all the benefits.
First the pro the, the most obvious one, of course it being containerization is just the degree to which it automates and simplifies our control over the module itself and all of the dependencies. Things like Kalos, things like comments are requirements dot TXT files of all of our pip packages, all of these things, it’s much more streamlined as [inaudible] should. It also has, as we’ve said, fluid integration with our existing deployment pipeline. I mentioned the GitHub actions, CI/CD and the pytest integration and the multiple OS support and the release tagging, but also on top of all that, it even gives us a test database integration as well. We have integrations tests within our pytest suite, which use a Postgres database and we’re able to create a Docker volume, which we can tear up and tear down.
That allows us to create a clean working environment from which to test the actual database connections and data flow via that. It’s very streamlined. Additionally, we can not only use those images for the purposes of our airflow deployments and production but we can also have Databricks jobs, which through their graphical user interface can very easily be pointed to different images for processes that maybe are for business or logistical reasons, things that we’re not necessarily ready to enter into our stable code base or into our production deployment but are still things in and out of bandwidth we want to be able to run sooner than later. In that regard, we can have Databricks jobs and our main deployments concurrently in a matter that has as high [inaudible] as possible, where maybe the only major difference is just some feature that we need for that Databricks job.
There’s also ease of debugging. For instance, when you create your container and have your image locally, you can do those PI test through Docker compose and then use a PDB set traces and just trace through their test framework. Again, with that Docker volume for even the integrations test and do that literally just as easily as if it were in your local environment, per se. On top of that, you can even SSH into the container itself and see specifically what is inside that container outside of just the pytest development context. It really is, I can’t stress enough just as easy, if not easier than local development. There’s none of that [inaudible] you sometimes get when you add these new tools into your development process.
Now I’ll move on to some of the cons and again, with the caveat that the pros far outweigh the cons but I’ll go through these as well. One is DBR version and compatibility. For instance, right now, DBR 8.x is not supported, only six and seven. When we created our image, it was 7.x but this was before 7.x Was supported. We had to create our own custom base image. In general, it’s likely the case that as you’re developing over time, whether it’s to manage various sub modules or other dependencies, it’s likely that you’ll need to do your own custom work on that image.
There’s also some pip package management involved and matching between the runtime specifications and what you have in your image. For instance, there was a point where we were mostly on 7.x but still had a few tasks that required 6.x clusters. Some of the differences between 6.x and 7.x such as for instance, the different versions of Spark then require different versions of other packages. We had to have separate requirements and separate images for these processes. Ultimately, it’s still much easier that way than having to do all of these things ad hoc, but it is something that you need to keep in mind.
Also, these image sizes can get very large. If you’re used to using Docker images for the sake of deployment processes, you might have your images be somewhere in the range of about a hundred to 500 megabytes. In the case of the Databricks runtime, that’s going to make the image much larger as you can see here, the Databricks runtime standard is 1.84 gigabytes. In our case, when you include Spire and you include the sub modules and dependencies and so on, it ends up being over two gigabytes per image. It is worth keeping in mind that these are rather large. On top of that, it does require prior doctor experience to customize the image, to be aware of those local memory constraints. There are also cases where, again, if you’re doing development with these images, Docker sometimes caches certain things. It can cache different dependencies. It can cache environment variables.
It’s not always clear where and why and how. Oftentimes you need to be doing pruning of your containers and images and all of this. Every time you have to rebuild the image or push the image or pull the image. All of these things are… there’s a lot of data involved in it. It can slow down your deployment pipeline and your deployment process. It is worth keeping that in mind, that if you’re trying to do very rapid fire deployments, this is going to slow you down a little bit. That being said, I think that the advantages that you gain from how the streamlines, the overall development process far outweigh those slowdowns but it is worth keeping them.
This won’t be the case if you’re using Azure Container Registry or AWS ECR, in our case, it made the most sense to use ghcr. One side effect of that and the fact that ease is basic awe is that in Databricks, it’s being stored in plain text, as you can see here with the username and password. Hopefully there’s no one on your Databricks cloud that shouldn’t have access to this anyway but it is a thing to be mindful of that this is in plain text. That could be a potential security issue.
Each usage of the image requires a pull of the container. When you’re deploying your earth, when you’re pushing your commits and it’s integrating this into the CI/CD, or in our case, our air flow task ends up launching these clusters or the Databricks jobs. You can see for instance, here with the Spire 3.3.1, just in a span of two months, there was over 200,000 pulls. This can grow quickly, this is a lot of data. Now, there are ways to do cluster pools and maybe keeping your clusters warmed. There may even be ways to cache this. This is something where I think we’re going to need to talk to Databricks and see what our options are. At least at the moment, this is the case for us. It is worth keeping in mind that this is a lot of data involved in the pushing and pulling of these images.
With all of that being said, I want to stress again, the pros far outweigh the cons. This has greatly streamlined our dependency management, our package management, it’s integrated really well into our CI/CD. It’s given us predictable run-time behavior even between both our production deployment and our various Databricks jobs. It’s seamless with our testing. All of this has been really wonderful and it’s impressive that we’re able to do this containerization with Databricks, but it is worth considering that there are still going to be dependency, synchronization issues potentially. You could have basic office security concerns if you’re using ghcr or something besides the recommended approaches. There’s overhead with Docker and the image sizes. All of these things are things that you should keep in mind but I think this is a worthwhile approach to running your production pipelines. That’s our presentation. Thank you very much.
Harin is a Machine Learning Engineer at Condé Nast where he researches and productionizes machine learning models in the domain of content and advertising recommendations. He has adopted cutting-edge...
Max Cantor is a Software Engineer of Machine Learning at Condé Nast. He designs and maintains machine learning platforms that scale to thousands of models and terabytes of data in a production enviro...