This session is continuation of “Automated Production Ready ML at Scale” in last Spark AI Summit at Europe. In this session you will learn about how H&M evolves reference architecture covering entire MLOps stack addressing a few common challenges in AI and Machine learning product, like development efficiency, end to end traceability, speed to production, etc.
This architecture has been adapted by multiple product teams managing 100”s of models across the entire H&M value chain and enables data scientists to develop model in a highly interactive environment, enable engineers to manage large scale model training and model serving pipeline with fully traceability.
The team presenting is currently responsible for ensuring that best practices and reference architecture are implemented on all product teams to accelerate H&M groups’ data driven business decision making journey.
Speaker: Keven Wang
– Hello good morning everyone, nice to meet you here online. My name is Keven, I hope both you, yourself, and your family are doing fine in this special year. Today, I’m honored to present MLOps in H&M and how we apply it at a scale. Let’s take a look. This is a high level agenda for today. Firstly, I will talk about our AI journey at H&M. Next, I will talk about our reference architecture generation one, which a lot focused on the machine learning process, and also how we orchestrate the machine learning training pipeline. In the end, I will talk about MLOps, and how we leverage it to operationalize AI work. You may have see been to H&M shop unless see it somewhere. H&M is a global company with multi brand. H&M vision is really bringing affordable fashion to everyone in the world. So in the last 74 years we opened more than 5000 stores globally in 74 different markets. Among 51 of them, we also operate online. I see we just opened a new one in Australia. H&M is very good at the scaling the physical retail model. Pretty much go for the high street, and follow where the shopping bag goes. However at the 2015-2016, we saw our business model is under attack. In one side, given our size, it’s become more and more difficult to manage the business by human, manually. Also, we see dramatic change of the customer behavior and then also competition from online. So we took commission, explore how we can leverage AI and data to do business decision support. We partner with some consulting firm, and the peak some fruitful use case, and to start a proof of concept. 2017, those proof of concept was successful. So we start to industrialize it, then and also deploying them in different countries. Also, we realized importance of data. So we first time start to establish our data platform since then. 2017, we establish AI department as a function in H&M group. It was really a big thing for us because the H&M group, it’s a first of new function in last 10 years. Also, we start to define our way of working governance. And in 2019, it was very fruitful year. We recruit lots of people and also organize ourselves as a product organization. We took over pretty much all the use cases, from the consulting partner and run them by ourself. Also, we start to establish our reference architecture. 2020, we took a new mission. Not only looking after our existing products, but also explore the generic compatibility. See how we can further scale AI across the entire H&M. The high level overview of our use cases. You can see pretty much it’s covering entire H&M value chain from design/buying, to logistic, sales and also marketing. Besides this use case, we also have four different product team focused on generic capability. Knowledge capturing and best practice, AI exploration and research, rapid development, last but not least AI platform. Some quick facts about us. Being our group we have more than 100 people working. Ones who are very proud. We have people from more than seven different nationalities working together. Each use case team is self-driving as a product team. And then they are also multi disciplined product team. ‘Cause it’s a product owner, data scientist, machine learning engineer, software engineer, data engineer, and also business analyst. Then you can not only develop an AI product, but also maintain it and to and own it. About one year ago, my colleague and myself, we present our first-generation reference architecture first time to the world at the Spark AI Summit in Amsterdam. Let’s do a little bit of recap, before we move on to the MLOps. When we start each product will focus on the value delivery, and deliver very fast. The other side of the coin is that lots of random technical decision has been made, and there is no central competence in terms of technical choice. So it lead us to a very fragmented architectural landscape, the entire group. That’s something that doesn’t help us to further scaling the use cases. So, we took one step back. Even if each of the pro team, they are solving different business problem potentially using different technique, however, the process are very similar. And top, you have a machine learning training pipeline, with multiple steps. At the bottom, you have a model deploying pipeline, also with multiple steps. In the middle, you have a few key concerns. Like how automated gets the end to end feedback loop and also get fast feedback loop. End to end monitoring, not only monitoring your infrastructure, but also your model performance. Traceability, how you can do proper model and even data virtually. If you start a new project without aware of these common process, you will tend to create entangled code and lots of technical depth. After model the process, we also pick up the right tool for each of these process that become our tour stack. Reference architecture. We never designed reference architecture out of blue. Instead we work with product team, trying to solve their daily pain point, and the side effect of that journey is our reference architecture. The first team we work with, they are newly established AI product team. And since they are newly starting product team, they love notebook, they love interactive development for the speed. However they’ll trade-off, the technical depth. So we have them to extract as a complex logic in the notebook into individual pies and modules so they can develop local IDE, and write proper unit test. Afterwards, we also introduce continuous integration and automation. In this way we can have a good balance between speed, and also amount of technical depth. The second use case we worked with, they are a mature product. They need to train many different models, for more than 500 scenarios. Each model for each scenario and each scenario represented the model for specific geographical location, like country, type of H&M product, like ladies’ T-shirt, and specific time like autumn, winter season. And then each scenario also contain multiple steps. These pretty much are like a giant to competition graph. Also known as directing a cyclical graph. Besides this we want to leverage the best two of each type of product each task, for example for the sourcing data and appropriate data, we want to leverage a spark for massive parallel processing. Then for the future engineering model training, and also optimization the size of data is smaller. And also there may not exist parallelism algorithm. So we want to leverage Docker container. So what we’re really looking for is an orchestrator, be able to run this deck, and also leverage heterogeneous competition platform. So we ensure airflow and the more importantly, we deploy airflow on top of Azure Kubernetes Service. Pretty much each task in your deck, become a single container pod. It will ease or call out external spark cluster to finish the data preparation task, or it will a run a competition local insider container. This give a rampage of elasticity. When there is large diagramming, it shut the class of shutter resource, new virtual machine will join the cluster and once deck is finished, it will be recycled back. Also we leverage cloud native service to deploy additional airflow components like Azure Postgresql Database and Fascia. In this way, the cluster becomes much more reliable and that we can minimize all maintenance effort. In addition, we also apply some tricks to solve airflow dependency management issue. Often when you introduce a new library into your task, and once you add this task in your tag and import this tag into airflow, airflow where triggers a cascading import, and then in the end you have to also make this additional library available you have flow scheduler to make it work. This create a dependency between your application, your tag, and Airflow infrastructure. How to solve it? Well, people from Java where they, heard about a reflection. There are similar technique in Python. So by leveraging reflection in Python, from the code here above you can see that, you can call a specific tasking tag, without actually import it. I write a medium post about it please feel free to check out I put a link down below. Where are we gradually solve problem for each individual product team, what’s the next step? We have even bigger mission, how we can really scaling and industrialize AI cross H&M. One initiative we started this year, is building our MLOps practice. Which has centralized a way to run the AI product. And also we started building our own AI platform, based on this practice. There are many of key concern in MLOps, for example, how we can keep a model version compatibility. How is a model approval process looks like? What’s the format of your model, is it Pixel fire? Or the Pixel fire class that require the txt? How about all that data preparation step? Maybe Top Image might be more suitable model format. How is your model deploying strategy looks like? What type of model method you want to keep track of? Get commuter ID, how about all the confusion parameter? And even how about all the model training data? How I tried to address some of them in the rest of this talk In high-level there are three building blocks in MLOps. Model training, model management, and also model deployment. This is reminding me the concept of message queue in computer science. Model training is like a message producer produce model. Where model deployment is message consumer consume model for the model serving. And then model management is like message broker. Then we can leverage the same, the coupling principle E-message queue for the same problem. Yeah, if we have well defined interface between these two, between these three components, we can pretty much evolve each of these components individually. This is very important, because machine learning and MLOps is really emerging area. We want to stay current and leverage the best tool for this. So we picked up a passer tool for each of these three component at current, for example, data breaks, airflow/cooper flow, for the model training pipeline. mlflow we see that’s the most mature solution today on the market for the model management. Then for the model’s deployment, our online model service, Kubernetes is the fact choice and also we love to sell them. Besides this we also want to include a system availability stack. We love Azure Native Service, Azure mainly is an art for choice, but we also know the great open-source tool like Funder and Prognosis that are great for the Kubernetes space application. Application life cycle management and continuous integration, continuous delivery. They are also very crucial here, the other clue here to put ML practice together. Last but not least, looking at the whole MLOps stack it’s so complicated, we need an infrastructure’s code, to be able to automate all the infrastructure setup. Model deployment. As I mentioned the beginning, we see these are two types of model deployment choice. More interactive one or the automated one. Depends on your product life cycle. If you are new product lots of interactive deployment, We prefer Data bricks based architecture, plus continuous integration and automation. This can keep the speed and measure the technical depth wear however, if you are a more mature product, have a challenge of scaling and automation, we seek either Airflow or Kubeflow based architecture would be better to orchestrate your pipeline. Few words about the Airflow versus Kubeflow. Kubeflow gets lots of attention in last one year. And I like some of its cool feature, like artifact management, lineage, and also Kashi. While airflow has been there for many years from Airbnb, since introduction of the Kubernetes executer, I see Airflow are very similar with Kubeflow. So personally if start something scratch, I’d probably pickup Kubeflow. However, we already have extensive usage Airflow. I don’t see clear motivation to migrate Airflow towards Kubeflow. Model serving, in software engineering there is established process how to deploy your product or your model for example, either deploying a product or model straight into production, or you can reschedule deployment. Basically have a new version of the model, deploy parallel with the existing version of model. But it were not to return the response back to the end user. In this way, because it will also hit the traffic, you can also measure how it performs. Then you can also do Canary deployment. Basically deploy new version model and existing model in parallel and the speed of the traffic like 50 50. So that’s another way you can compare their performance. In data science and machine learning we also talk about expand strategy, for example AB test. Basically you divide all your customer population into different experiment group. Is either router here where have to router the corresponding users request, into corresponding model based on the experiment group. Then you can evaluate each of the model. Also, Multi-armed Bandit model routing become quite popular recently. In this concept, there are also where always try to route majority clients request into most best performing model. For example, model eight here. It will also router small partial any request random into other models. How do we keep track of the best performing model? Well, feedback loop. There would be external rewarding system to provide feedback to a router, so it can keep track of which model perform best. For us, the model survey or model prediction online model prediction is mostly just the resting phase and also pick a model. You can see it as inference graph like here. You may have some input customization, and also output transformation. You may also have more than one model. Here, two model chain together. Therefore each of model, you can use a do multi arm bandit routing or the AB testing routing. Set on support it’s kind of advanced inference graph. That’s why we see it’s a good fit for the model serving. In the end let’s talk about a model management and life cycle and take a look at how the entire MLOps process looks like. Being high level, you have five stage. Model development, back test, model approval, staging, and production. We start with development checkout the new user story, quit official brunch, finish the coding and to make a PR. This is what triggers a PR pipeline, which I do mostly aesthetic quarter check and also scrutiny scanning. In the end it will also trigger the new CR pipeline. Namely training CR pipeline, the training CR pipeline will take a new code, and deploy it into one of your model extrusion pipeline and trans model. In the end, it will result a new model. The mlflow model registry with a version dev. Also, it will trigger the back test pipeline. In back test pipeline it will leverage infrastructure code with a strap in your infrastructure, deploy the new model here and run all the basic back test. Afterwards then test it in your product team. Where log into Azure dev ops here and exam important KPN or model training pipeline and also best test results. Hopefully, he or she will approve the model. Then this is will bump up the model from dev to staging very free. And also to triggers the auto deployment model into staging environment. In the end the same, another dev you go pro team like your PO where again, logging into Azure Dev ops and the exams model of the staging environment hopefully approve it. This will trigger another bumping up on the model version, from staging to production. And the same also triggers auto deployment of the new model into the production environment. You see Azure dev ops after pipeline, this continuous integration, continuous delivery framework acting as a glue to put all your MLOps practice together. We love automation but also we believe that for some critical step, it’s important to have a human in the loop. Some takeaway I would like to share, MLOps is very complex. Instead of in the beginning start with just architecture, It’s important to take one step back, think about what kind of problem you are trying to address. What kind of process do you have, and which kind of enroll you have in your team where they are involved. Then start looking at architecture. Also the tech stack of MLOps is very complicated. Pretty much entire product team need to manage not only the data size workload but also the infrastructure, like Kubernetes Airflow Kubeflow, data breaks, monitoring solution, you name it. Then it’s very difficult to set up a such a team. So instead of we sync it makes sense to centralize some of the key infrastructure and then run it centrally, then offers ML service quite on product team. Last but least if possible, leverage the cloud native service from beginning. This will give you advantage of the speed. Then only when you feel that existing services in cloud cannot fulfill your needs, start with developing something new, your own sorry. This concluded my talk. Thank you very much. And don’t forget to provide feedback.
With 15+ years of experience, Keven becomes a specialist in AI and Data. Besides hands on experience, Keven also has taken various technical leader roles, helping different organization to build AI and data capability, establish tech foundation. Currently Keven is competence lead and also AI architect in H&M group, manages a group of machine learning engineers, also responsible for engineering and architecture.