The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
At Zalando – europe’s biggest online platform for fashion – we realized that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge – the data owners – while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh. The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership. This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture backed by Spark and build on Delta Lake, and will outline the ongoing efforts to make creation of data products as simple as applying a template.
Speakers: Max Schultze and Arif Wider
– Hello and welcome everyone to “Data Mesh in Practice” here at Data+AI Summit 2020. I am Max Schultze and I’m joined by Arif Wider today to talk to you about how Europe’s leading online platform for fashion goes beyond the data lake. But before we dive straight into the topic, let me give us a short introduction about ourselves. I for my site, I’m a lead data engineer at Zalando, Europe’s biggest online fashion platform and I’ve been with the company for about five years. During that time, I was mostly working on data infrastructure topics, specifically focusing on distributed storage, distributed compute and building a lot of data tooling and data services on top of that. Beyond that, I’m a huge gamer. So I’m a big fan of League of Legends. I’ve been playing a lot of Animal Crossing recently but most importantly, I probably spent my last decade on traveling to Magic: the Gathering tournaments on several different levels of competition all over the world. So if there are any topics around gaming feel free to chat me up after the talk. Arif, do you want to introduce yourself as well?
– Yes. Hi everyone. So yeah, my name is Arif. I’m a software engineer as well, but also since very recently, I’m a software engineering professor at HTW, which is a Berlin based university. But before taking this position as a professor, I was leading the data and AI business for ThoughtWorks Germany. And therefore I was a full-time consultant, which brought me among other clients to Zalando and therefore to meet Max which is ultimately the reason why we’re giving this talk here together today. I actually continued to work with ThoughtWorks in what we call a fellow role but it is only on a part-time basis. Now next to work or actually also quite a lot during work, I am quite serious about coffee. So yeah, if you wanna have a lengthy discussion about a topic with me that is not about software, then coffee is certainly something that I’m always happy to talk about. All right and then I give it back to Max and let’s get going.
– Thanks for great intro Arif. So what can you expect from us today? Well, at first I will give you an introduction about Zalando’s data platform or what was the starting point of our journey that we actually went on here. Then Arif will give you a little bit of an overview of what this data mesh is actually about, what this concept, what is behind this concept and what are the main pillars of it. And then I will take you back on our journey to actually show you a little bit of how we were able to apply some of these principles in practice. So starting off with Zalando’s data platform. When I am talking about Zalando’s data platform, usually I’m talking about three big areas. The first one is ingestion, how does the data actually get in? The second one is storage, how the data is actually persisted? And then of course eventually also accessed through the third part, which is the serving layer and what are the possibilities to actually use this data and to actually derive information from it? When it comes to the first part, we’re talking here about three main data sources and three main central data pipelines that we were maintaining at that time. We are having a pipeline that gets data from our central Eventbus, something that was originally purposed for microservice to service communication, but soon proved to contain a lot of very valuable information also for analytics. We have a data pipeline that is collecting data from a legacy where a data warehouse that we still have. Yes, of course there’s still always some legacy that you have somewhere, and even though they are big migration projects to actually get away from that, there are still a lot of very, very valuable information in there that is totally worth also getting into the central data platform. And last but not least we are also collecting behavioral data about the customers, how they are actually moving on the webpage of Zalando and putting this data also available centrally to combine it with data, like for instance sales information and take very, very valuable decisions based on top of that. When it comes to the storage site, for us the decision usually at the beginning was very simple, Zalando is mostly AWS based, which means that we are, when looking for a very simple storage solution we went with exactly that as three, as our main backup to store all our data. But when talking about storing data it’s not just about storing the data itself, but also storing data about the data, like meta information, like for instance, usage analytics of the data to later on pass this along to the producers of the data as well. Lastly, on the serving site, we have a bit of a split model between analytics and transformational processing. On the analytical side, we are offering Presto as a distributed SQL engine where mostly analysts or also a lot of non-technical users can use this as an option to generate insights from the data that we store. On the processing side, we offer something that we call a Central Processing Platform, which is essentially spark infrastructure as a service to the rest of the company. So we are providing spark clusters, we are here a big customer of data breaks to work this out together and to provide infrastructure so that not every team has to do these things by themselves. And lastly, we are using Collibra as a data catalog which is the main entry point for discovery of data and for also learning additional things about the data before actually starting to use it on that side. Looking at this model that we have set up here, there’s one big challenge that it brings with us, and this is centralization. All of these components are maintained centrally and this brings a couple of challenges with it. The first challenge that I want to introduce is that data sets provided by central data infrastructure teams lead to a lack of ownership. There’s a lot of data that is flowing through these central data pipelines that is default archived into the data lake that we have, but whenever somebody wants to use that there’s actually a missing connection to the people that actually provide that data that actually own that data. Secondly, we have the issue that data pipelines operated by central data infrastructure teams, lead to a lack of quality. From an infrastructure perspective, I usually care about things like latency of my data, like throughput, like correctness on, let’s say a data set level, but I will never look into the content of the data. How can I? we are talking about thousands of data sets here, we are talking about petabytes of data that we are actually processing, and it’s impossible for a central infrastructure team of a couple of engineers to really get ahead of all of that and understand all the content of all of it. Which lastly brings me also to the third big challenge, which is the moment your organization starts scaling, the central team immediately becomes the bottleneck. They will always be the first contact point for any user that actually wants to do something with the data but also find the producer that actually wants to let’s say introduce a breaking change because they don’t even know who their actually users are and everything is always funneled through the central data infrastructure team which immediately becomes the bottleneck. These are some of the challenges that we observed in the central setup that we were operating. Arif, can you maybe take it from here and tell us if these were challenges that just we’re facing, or if this is something that you already observed also in other contexts?
– Yes, of course. So having been a consultant for the last couple of years, I was also part of several such data platform and data engineering teams, and I can indeed confirm that what Max just described is really a recurring pattern. And I’ve seen those issues with data ownership and data quality again and again. So if you zoom out a little bit, you can describe the situation often in the following way, there are three general parties, the data producers, the data consumers and the people who are building the central data platform. So if we start on the left with the data producers, those are the people and the teams building the production services and they’re often fairly okay with the situation, they do not feel the pain of lack of data quality that much unless they are data consumers themselves, because they are basically happily generating their data, they have their own incentives about their production services. Then on the right side, we have the data consumers and they definitely already feel pain points of lack of data ownership and lack of data quality, because those are data scientists, the decision makers that really have to then try to understand what a certain field means and deal with those quality issues. Right. But then when we look at the party in the middle, that is the people who built a central data platform, they are really the ones who are often in a very tight spot, because they are the ones who have all the responsibility to provide high quality data in a reliable fashion to all those data consumers. But at the same time, they have not much control over the quality of that data because they are not the ones who are ultimately generating this data, right. So what I’ve seen very often is that the people in those teams, they are firefighting all day trying to solve issues that have been introduced somewhere upstream. So overall, this is not a very good situation, not for the central data engineering team, but also not overall. Now interestingly, this situation appears no matter whether you are building a classical data warehouse or whether you’re building a more modern data lake, because the issue here is not technology but it is the centralization itself, or to be more specific the central ownership. So if we look at this picture here for instance, and let’s say we have on the left side an ever-growing number of data producers and we have on the right side an ever-growing number of data consumers, then the central data platform will always become a bottleneck. But why is this actually the case that this is becoming a bottleneck? One important reason is that such and such a centralized data platform with centralized data ownership is kind of a data monolith I would say that cuts through domains. So what do I mean with this? Let’s imagine we have a checkout service and that is maintained by a team and that is one of the teams of the data producers here on the left, right. And they are building this checkout service and this checkout service is generating checkout events which in the end, end up in the central data platform. Now, when you look at this, you see that the knowledge about checkouts is scattered across different teams. And that also means that the responsibility is scattered and that leads to friction, to misunderstandings and ultimately prevents this whole model to scale. Now coming to the data mesh or coming from those observations, what really is the data mesh? The data mesh is mostly a paradigm that is about applying concepts and approaches that have been applied to the general software engineering domain for years very successfully, but now applying them to the specific challenges in the software engineering and the data engineering domain. And specifically, I want to go into three of those concepts and those are first product thinking, second domain driven distributed architecture and third infrastructure as a platform. And now before I go into those one by one, let me quickly refer to the original data mesh article by Zhamak Dehghani that you still find on martinfowler.com. Because Jean-Marc really was the one who coined the term ‘data mesh’ initially. And this article, which is really an excellent article explains all the details of the data mesh paradigm in much more depth than what I can do in those few minutes here. So now let’s go through those concepts one by one, and let’s start with product thinking. And I wanna start with product thinking because I feel this is actually the concept that has the biggest potential to help with those ownership issues that Max talked about earlier. And this is mainly about thinking of data as a product. So what does this mean? It means that you really think of a data set-like product on a market. And that means that you have to answer questions such as, what is my market? Who are my customers? And how do I even make sure that my customers know about my product, right? And in order to answer or tackle those questions thoroughly, you probably want to have a dedicated role for this which can often be something like a data product manager and such a data product manager should be part of a cross-functional team that is taking full ownership of such a data product, that means they are maintaining all the data pipelines, have all the knowledge et cetera, to make all the value of this data set available to customers within the organization. Now the second concept here is domain-driven distributed architecture and applying this here means that one of such data products should really always capture one clearly defined domain. And that means that the team building that data product or owning that data product, they can really become domain experts about that domain and that is the key idea here. And then when you have such a team of domain experts here, then such a data product can serve as the fundamental building block of building a mesh of such data products. And there can be different kinds of data products, like for instance you can have more source oriented domain data sets that are closer to where the data is generated, but you can also have so-called “aggregated domains” that consume other data products to create some higher value by building things on top of the existing data products. And now I’m looking at this thing of a data mesh set up, this has certain similarities to a microservice architecture. And similarly to microservices, we as an industry, we develop a certain notion what a microservice really is. For instance that it needs to provide monitoring, needs to be independently deployable, those kinds of things, otherwise we wouldn’t call just any service a micro service. And similarly here, not every data set is a data product. In order to be a data product it needs to be discoverable, that means you need to be able to find it. It needs to be addressable, self-describing, secure and trustworthy. And more important than all these other things, a data product needs to be interoperable and this needs to be provided by an open standard because only then you can build an ecosystem of those data products from it. Now finally coming to the third concept here which is data infrastructure as a platform. So first of all, the idea of providing data infrastructure as a platform here is simply to avoid unnecessary duplicate effort, right. And this idea is not very new, this is what the big cloud providers are doing for years. A key thing to make sure here is that this data infrastructure platform really isn’t data infrastructure platform and doesn’t become a data platform. That means it needs to be domain agnostic. What does that mean? It means that if you are for instance a developer, an engineer, of that team that is building the data infrastructure platform and suddenly you need to understand some of the details of the data products that you are supporting with the data infrastructure platform, then you already need to have domain knowledge. And that means you’re already on the way again to become a centralized data platform instead of being a domain agnostic data infrastructure platform. So wrapping this up before I hand it back to Max again who can give you exactly that perspective or the data infrastructure platform, let me sum up that the data mesh is really much more of a mindset shift than a technological shift. It’s about going from centralized ownership to decentralized ownership. It’s about looking at data domains as the first-class concern and not technical things such as pipelines. It’s about treating data as the product and not just as a byproduct, and it’s about building cross-functional domain data teams and an ecosystem of data products from this. And with this, I hand it back to Max who will tell you more about the data infrastructure.
– Thanks a lot Arif for this great introduction into the data mesh concept. So what I now want to do is to talk to you a little bit about how we started applying some of these concepts in practice, concretely what are the things that we were able to achieve on the technical side but also what were the changes that we made on the organizational side? And of course I also want to then give you some numbers of the wards for backing this up with the adoption that we already see and the value we are generating here in our company. So first let me shortly recap of where we were actually coming from while we were in this situation where the central team was always the bottleneck, but we as Arif rightfully explained, we much rather want to get to the setup when we have a data infrastructure as a platform where people can use self-service tooling to actually do the things without the involvement of central electrical engineers. At the same time we want to get away from the central data mono list from the central data platform, and we want to get towards into operable services. Well, how did we actually start setting this up? Well, the important part is of course that there was a lot of things that we already had in place. We had central services that were already up and running. We already had central data sets, thousands of data sets, petabytes of data and we already had the stored and well-described as well as already having a governance layer on top of that, that was managing things like data access, that was managing things like automated meta data collection. And these were all things that we could already leverage to expand them in scope and to bring them better into a global, into operable setup. So the first thing that we then started adding on top of that was a concept that we called “Bring Your Own Bucket” or “BYOB”, how we like to shorten it, which essentially explains a setup where teams now have the possibilities to store data by themselves. They work in their own AWS accounts, they have their own buckets, they now are completely free to store data in their own buckets but share it with the same day central data infrastructure that already exists. So to plug it into this governance layer that was already available, and to make data available and with that lowered the barrier to actually bring your data in. At the same time this was immediately leading to people picking up ownership of the data sets they were actually providing, because now they store it and they much easier, immediately feel a responsibility for the data that they actually provide to others. The next thing that we went into was well, we doubled down on this processing platform approach that we already started before. The processing platform for that matter is central provision of infrastructure without actually knowing what the people use it for. And just to give you an example for the spark cluster set up that we had, there was a lot of teams who were operating spark clusters by themselves and every new team that had a new use case every single day was reinventing the wheel just to figure out how to run and operate a spark cluster, not even speaking about cost efficiency or anything like this. And this was a big point of centralizing this part and bringing it together into a team that then takes care just off the infrastructure part. By now if I want to have a spark use case and I want to run it in production, well, I actually go to the central team via a template, describe to them what I need and then I get my cluster and I can do with it whatever I want. And then we went one step further and did not just stop at as three buckets, for actually contributing data, but we started adding more and more things like Redshift’s, like RDS Postgres’s, and not just even for getting the data from there, but also for writing back and really expanding this infrastructure setup that we had. There are common use cases like people who are picking up data to just archive from the event bus, loading it into their Redshift and again using it for some of their use cases and setting up all of this without the involvement of a single central engineer. These were like some of the bigger changes that we did on the technical side. But now I also want to move over to the organizational side a little bit. So where would we actually come from here, right? Decentralized ownership does not necessarily imply decentralized infrastructure. And that’s was one of our biggest learnings because now we had de-central storage, but we still had central infrastructure that was provided by central teams. We now have de-central ownership because through de-centralization of the data the people now have much higher responsibility for what they are actually providing. Yet, still we have the central governance layer which has taken care of the processes around data access, around automated metadata classification for instance. Because true interoperability is only created through convenient solutions of a self-service platform. Now, going to this organizational side, just to recap again, we were in the situation that where a central team was providing data and that led to the problems of lack of ownership, of lack of quality. And we started addressing this by immediately working with the teams who are actually providing the data to ensure the quality from that perspective. Instead of default storing all the data sets, we went to a model where we asked the people to actively opt in to the site actively and make conscious decisions about what they should store, and to move to this behavioral change to treat data as a product, to have dedicated people that take care of this data. At the same time we were looking about into the usage of the data, because, well, how can you give the best incentives for caring about the quality? Well, you have to care about your users and looking about at the data sets that we had, they were around 70% of data sets that were not used at all. There were some data sets that were used every once in a while, they could probably stay as they are. But there were data sets, which were like the golden data sets that were available that everybody was using, that the most value was actually generated from. And making people aware of this allowed them to actually dedicate resources to a, understand the usage of the data sets that they were providing. But then on the other side, also take these additional resources to ensure quality and to even expand the features potentially even at new data sets afterwards. Backing this up with some numbers to show you the adoption of the usage that we already had. Just as a rough context, we are currently working with around 200 teams in tech at Zalando and out of these 40 teams already started adopting this bring your own bucket or bucket approach to share data. And we got a lot of very positive feedback that this simplifying this data sharing allowed for you to increase the productivity on their side. Over 100 teams are already using the processing platform to use central data infrastructure for actually taking care of their parts. And we even have already first curated data teams that build data products on top of other data products to really get to this aggregated level of data products that Arif was already mentioning it about. Most importantly we reached a point where we actually have zero operational effort for the central team. Well, not exactly, not yet. It’s a journey. We are still on it, we are actually striving to get there, but there’s of course still a long way that we have to go. The amount of operational effort was dramatically reduced that the central effort the team had before but there are still a couple of things that we need to do here. And because it’s a journey, I also want to just give a very glimpse outlook of what are the things that we are currently tinkering with and this is something that I like to describe as “off the shelf data tooling”. This is really something where you provide blueprints of standardized solutions that people can just take and apply for their use cases. This contains some things like decentralized archiving like giving people the option to make sure they’re data actually ends up in the storage that they maintain. This contains decentralized GDPR deletion tooling. Once you store data, you are responsible for it. You need to make sure it’s compliant and we offer you the tooling to actually take care of this. And lastly, this contains template driven data preparation where really we want to give some blueprints to take care of some standardized use cases. As I said, this is a journey that we are currently on, but interestingly this is a journey that you also can join. All of the teams that are actually currently involved in all of these topics that we’ve been describing today are currently hiring. So if you have an interest into taking part of this journey and making this you’re on your own, please talk to us after the talk. With that being said, I want to close it out. This was “Data Mesh in Practice” today, and I’m Max Schultze joined today by Arif and together we want to thank you for your attendance.
Zalando
Max Schultze is a lead data engineer working on building a data lake at Zalando, Europe’s biggest online platform for fashion. His focus lies on building data pipelines at petabytes scale and productionizing Spark and Presto on Delta Lake inside the company. He graduated from the Humboldt University of Berlin, actively taking part in the university’s initial development of Apache Flink.
Dr. Arif Wider is a professor for software engineering at HTW Berlin, Germany, and a lead technology consultant with ThoughtWorks where he worked with Zhamak Dehghani who coined the term Data Mesh in 2019. Next to teaching, he enjoys building scalable software that makes an impact, as well as building teams that create such software. More specifically he is fascinated by applications of Artificial Intelligence and how effectively building such applications requires data scientists and developers (like himself) to work closely together.