The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
At Zalando – europe’s biggest online fashion retailer – we realized that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge – the data owners – while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh. The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership. This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture backed by Spark and build on Delta Lake, and will outline the ongoing efforts to make creation of data products as simple as applying a template.
– Hello and welcome everyone to Data Mesh in Practice. I’m Max Schultze and I’m joined today by Arif Wider to talk to you about how Europe’s leading online platform for fashion goes beyond the data lake. Normally, this would be the situation where I ask the question into the round and ask to put your hands up for who actually knows Zalando. But due to the remote circumstances that we’re in, I have to just leave you with some facts about Zalando as a company in general. So Zalando was found as a startup about 12 years ago to sell shoes online. Pretty simple business model but they actually started pushing pretty hard and they advanced a lot to actually grow to become Europe’s leading
online platform for fashion right now. So by now we have over 14,000 employees in total at Zalando, of which over 2000 are working in tech alone and this is exactly the scope of our internal data lake and also data mesh that we will be guiding you through now. So a little bit about us first, who are we? I’m Max, I’ve been with Zalando for about 4 1/2 years now. I’ve been mostly working on the data lake internally at Zalando as well as building it up in the first place. But also evolving it over time. And I’m still one of the people leading the effort at the moment. Arif over here, he is the Head of AI for ThoughtWorks in Germany, but he’s also a lead technology consultant who has been with Zalando for the last nine months. And we’ve been working together on this concept and also on bringing it into practice together. Just a couple of fun facts about us here as well. If you are very interested in coffee, you should absolutely have session with Arif afterwards because he’s a huge coffee geek. And also if you’re more into gaming, well then I think you should have a followup session with me because I call myself a retired, semi-professional Magic: the Gathering player.
What can you actually get from this session here today? What are we about to show you? So first of all, I will walk you through the Zalando analytics cloud journey. Where did we start off with our data analytics? Where did we pass through and where we are now. Then Arif will take over to actually give you a little bit of a background of what is this data mesh that everybody’s talking about recently? And then I will follow up to actually show you some of the things that we already put in practice and what are the things that we are currently working on that we are looking to work on next? So when I’m speaking about the Zalando analytics cloud journey, I have to let you know that Zalando has been a data-driven company pretty much from the start. Data was always a key to the success of Zalando. A lot of data was collected over time. A lot of decisions were based on data and this is how Zalando became successful in the first place. When I’m talking about the Zalando analytics infrastructure, this is how everything started when coming back from the infrastructure perspective of the company. Originally, the whole company was running in the datacenter. It was a huge shop monolith which had a bunch of parts and services that were serving as the backend of that one. And it was a big data warehouse to actually integrate all of the data for analytical purposes. At that time, because everything was running in the same datacenter, it was pretty easy to get connectivity to the back end databases to get a direct JDBC connection, fetch the information that you actually needed, and then pull that all together in the central data warehouse for analytical purposes but also for reporting.
This, at some point, hit the limitations of scale. So Zalando wanted to have flexibility in terms of its infrastructure but also wanted to scale as soon as possible. And this is when the company started moving to the cloud. This actually means we were starting to build up a microservices architecture. And all the microservices were now sitting in their separate environments. They were not directly reachable anymore but they needed still a mechanism to actually communication to each other. And that’s when we established a company-wide messaging bus. That’s a pretty common industry standard by now. And we started having this as the primary communication channel between the different microservices. But at that time it was also not possible anymore to just have this one big data warehouse and pool all this information from the source databases to do your analytics on. But at that time, we also realized that all the data we were actually looking for was already in this central messaging bus. And that’s when we decided to add a secondary purpose to this messaging bus and make it the first big ingestion source for the Zalando data lake.
When I’m talking about the Zalando data lake, there’s essentially three big areas that I’m talking about now. The first one, which I already mentioned, is the ingestion part. So here we are speaking about the big data pipelines that we have to actually feed the source. Then there is the storage part which is, well, the backend where all the data is stored
but also the data about the data, the metadata. And then there is the serving layer on top of it which actually allows people to run analytics. From the ingestion perspective, we have three major pipelines of which one I already spoke about which is the pipeline from the company-wide event bus. To archive all of the data sets which are flowing through that event bus in the data lake. The second big pipeline is connected to the legacy data warehouse. Over the last 10 to 12 years, a lot of information and knowledge was put into this data warehouse. And there’s a lot of data that is still flowing through there that is still very valuable in making decisions as of today. So for that matter, we still had to build a pipeline to copy a lot of these still very valuable data sources also into the data lake. The last pipeline I want to mention is about web tracking data.
So here we are collecting behavioral data
about the users on the Zalando website which becomes especially valuable when, for instance, combining that with sales information.
On the storage side, for us it was actually pretty simple
because Zalando’s almost exclusively using AWS. So we were actually looking for a simple storage solution and we went with exactly that, AWS S3, as the main data storage of our company.
But as I already mentioned, storage of the data is not just about the data itself but also about the metadata that you actually store with it. And this comes with all sorts of shapes, like how big are my data sets actually? When have they been last updated? But also more high level information about how these datasets are actually structured or how they combine with each other. Lastly, we have the serving layer which essentially we broke down into two major pieces. The first one is a fast query layer, which is mostly used for adhoc analytics and for people to directly get access to the data. We are using Presto here as our most, as our main execution engine that is running all these adhoc requests and is acting as the fast query layer. The second big part are data transformation. And this is where we have a centrally operated processing platform which is running on Spark provided by databricks. This is where most of the teams who have an interest into changing data, into transforming data, but also into generating new data sets. They will be using this infrastructure to actually do the thing that they need for their central capabilities. Lastly I want to mention that we also have a data catalog which is the first entry point for exploration, for actually first identifying what datasets exist and how to use them. This whole process and this whole setup that I just described also brings a lot of challenges. Especially the factor around centralization of these infrastructure parts. When we are looking at usage of data, we usually have three big actors here. The first one is the one who’s producing the data. Then there’s the infrastructure team in the middle which is mostly us and that’s in this picture. And then there’s the actually user at the end. So if the data, if the people who are producing the data, they are just sending events into a messaging bus, when then there is a central data pipeline who is taking this information and who is putting it into a central archive. From that perspective, the producers of this data, they actually don’t feel strong ownership for these datasets.
Some of them for a long time were not even aware they are archived in the first place. But then you take it from the infrastructure team’s perspective, and they also don’t have an actual ownership
for the content of specific datasets. We are dealing here with 1000s of datasets. We are talking really about 15 petabytes of data that we have stored. And that actually leads to the users, they will be in the dark because nobody’s really the owner of these datasets and somehow they need to figure out this information. At the same time, when the pipelines are also operated by this, I’d like to call us data agnostic infrastructure team, there can also be quality issues in your data because we as the infrastructure team, we care about the pipeline running. We care about everything is working, everything is working at scale. Thousands of datasets being processed.
But we are not looking directly into each specific dataset to ensure the quality. And because again, the owners, the actual producers of the data, they don’t have this strong ownership, there can very much be a lack of quality in the data. And last but not least, this actually leads to a point of organizational scaling where the central team, the central infrastructure team which is us for most of the time, we are becoming the bottleneck. Because the producers, they don’t feel strong connection to the archived datasets. The users, they always reached out directly to us when they have an issue because of exactly that and we are sitting in the middle to just try to juggle and be in firefighting mode all of the time. These were some of the challenges that we were facing over the last months and years and this is where I want to hand it over to Arif to get us a bit more of a generalization among these.
– Yeah, thank you Max. So me being a consultant, I have been part of several teams building central data platforms. And basically the challenges that Max just described, they are really a very recurring pattern and just very common and I’ve seen them again and again. And so if you take a step back, the pattern that you generally see is that you have those three parties that Max mentioned earlier which is
on the left side, the data producers.
Those are often the product teams building the production services and they are generating data and they are generally very happy. They have their own incentives and they are doing fine with this. On the right side, on the other hand, we have the data consumers. Those can actually be product teams as well but often they are decision makers or data scientists building data applications. And they are already struggling quite a bit in such a situation with a central data platform because they are really already feeling those effects of the lack of ownership and the lack of quality when they want to get their value out of the data. But then in the middle, we have the data engineers building this central data platform and they are really in a tight spot because basically they have the responsibility to provide high quality reliable data to the data consumers but they are mostly firefighting all the time about issues that have been introduced upstream by changes from the data generating teams. So that means they need to solve issues where they are really not the domain experts for. And at the same time, they of course get all the complaints from the data consumers and that is really not a great place to be in. Now interestingly, this situation appears no matter whether you’re building a classical data warehouse or a classic, or a data lake setup. Because the reason here is
that this is not a technical reason. Instead, it really has to do with the central ownership of data. Simply, you can imagine if on the left side here
the number of data producers is ever-growing and on the right side, the number of data producers is ever-growing then ultimately this central data platform in the middle becomes a bottleneck. Now what is the actual reason for this? The actual reason here is that this central data platform, when it owns the data centrally, cuts through domains. So let’s imagine for instance that we have a checkout service here and that checkout service is producing checkout events. So now the domain knowledge of that checkout domain is basically scattered between the team building a checkout service and the team maintaining the central data platform. And that means that really the boundary of the data platform cuts through the domain and that creates friction and ultimately creates the inability to scale.
So coming from those observations, we’ve come up with this data mesh paradigm. And the data mesh paradigm really is mostly about applying techniques to the data domain and to data challenges that we have already applied successfully in the general software engineering domain. And the key concepts and key ideas here are to apply product thinking,
domain-driven distributed architecture,
and infrastructure as a platform to data challenges. And now, before I go through them one by one, let me quickly mention if you have not already read the excellent article by Zharnak Dehghani, maybe have a look later today at Martin Fowler’s blog. Zharnak really was the first one to coin this term and to write this excellent article and she really describes this in much more detail than what I can explain in those few minutes. So therefore let’s go right into those three main ideas. The first one is product thinking. That means to treat data as a product. And I think I’ll start with this one because I feel it’s the one
that really helps the most with those ownership issues. So what does it mean to treat data as a product? It means that you really think about things like what is my market,
who are my customers, right? But also it means how do my customers even get to know about my data, right? How do I do marketing about my data? And maybe even things like what is my USP? So how do I distinguish my data offering from other data offerings? And usually treating those things thoroughly means that you want to have a dedicated person going through them. So what you want to have is a data product manager or a data product owner. And that data product owner should be part of a cross-functional team with data engineers, software engineers, and a product manager who are building this data product as an autonomous offering that really delivers this value to their customers.
So now the second key idea here is to apply domain-driven distributed architecture. And that means that such a data product that I just talked about really focuses on one clearly-defined domain. So that’s the key idea here, to really focus on a domain. And that means that the people who are in the team building a data product can really focus on that domain-specific knowledge of that domain and become domain experts. And then you want to make those data products
the fundamental building blocks of a mesh. So you want to have many of such data products that are each focusing on one domain and then working together. So you could imagine on the left side here, several individual domains that are more focused on source data, so called source aligned domain data products. So for instance you could imagine web tracking data or something like this. And on the right side, you could also imagine more aggregated domains that focus on higher level data, let’s say, for instance recommendations or something. And those domains or those data products, they would actually consume other data products.
Now this mesh of data products has quite some similarity to a microservice architecture. And so similarly to microservices, not every service that you build automatically is a microservice, at least not in the way that we understand it today. Instead there are certain requirements like SLOs and things you want to have in place. And this is the same with the data product. The data product only really qualifies as a data product if it is discoverable. You need to be able to find it. It needs to be addressable, it needs to be self-describing, secure and trustworthy. And most importantly, it needs to be interoperable. Only if it’s interoperable you can really create an ecosystem of data products from this.
And this is where the third key idea comes in, which is data infrastructure as a platform. First of all, infrastructure as a platform is not really a new idea.
The cloud providers are building those kind of self-service infrastructure platforms
for quite a while now.
And the main idea here is to avoid that effort is duplicated. So ideally this data infrastructure as a platform enables those teams building those data products to really build new data products quickly and that’s also a great way to measure the effectiveness of this data infrastructure platform, how quickly other teams can build new data products. However, you need to be quite careful here that this data infrastructure platform doesn’t become a data platform. So you need to be careful that
this data infrastructure platform stays domain agnostic. So for instance, when you’re working on such a data infrastructure platform and you realize that you need to have some domain knowledge to solve a problem, that means you’re already again on the path to a monolithic data platform and this is something that needs to be avoided.
So now, before I hand it back to Max who can tell you more about what they do in this data infrastructure platform, let me quickly wrap up with the idea that the data mesh paradigm is really much more of a mindset shift than of a technological shift. And that means you want to go from central ownership to decentralized ownership.
You want to go from looking at technical things like pipelines to looking at domains as the first-class concern. And you want to go from having data as a byproduct to having data as the product. And this means going away from siloed data engineering teams to cross-functional domain data teams and ultimately to go away from a centralized data lake to an ecosystem of data products. And this is where I hand back to Max. – Thanks Arif for this great introduction to the general concept and ideas behind the data mesh. Now I will, as he just said, go a little bit into the direction of what we actually did in practice. What are the parts of this concept, and what are the things that we already were able to implement on our side in Zalando? A thing here that I have to mention is, if you haven’t noticed that yet,
I’m mostly speaking from the perspective of the infrastructure team, right? That is the team in the middle and that’s also again why I will focus much more on what are the things that we could focus on from our perspective? First of all, the quick recap. We were coming from the situation where the infrastructure team was actually the bottleneck of the whole process. But we want to much rather be in the situation where infrastructure becomes the platform and it’s much easier to just use it without always having to involve the team. At the same time, we want to get away from this data monolith but we still want to keep services interoperable. And this is what I now want to dive a little bit into. How we started evolving from what we originally had from this central data lake towards this more decentral approach that we are now running on. First of all, we still have these central services that we started with so we still have this data lake storage which was this centralized storage which we already had from the very beginning. We also already had this metadata layer on top of that or around that, much more. Which is governing this data not only by providing additional information about it but also by allowing standardized processes around, for instance, how to get access to data and how to manage data in the first place. What we came up with was a concept that we called bring your own bucket. This essentially allows other people to plug their S3 buckets and plug the datasets that they stored by themselves in their– (electronic feedback drowning out speech) Into this central infrastructure. So actually, and this is where AWS actually makes it really nice to integrate. Behind the scenes S3, there is no actual physical distribution. It’s a logical distribution layer and this was helping us out a lot to easily take S3 buckets from other teams and other accounts and add them to this central infrastructure layer. At the same time, we are now using this data and we are using the same metadata layer, the same governance layer that we have around this to apply similar processes around access management and around providing additional information about these datasets. The second big thing that we are applying here is we still have the central processing platform that I was mentioning earlier. This was the first part where we have really been successful to provide central infrastructure in a data-agnostic way. So here we really have infrastructure that people can use. We’re providing clusters, Spark clusters for most of the time here and people can use these clusters that we provide without needing to care about actually operating the infrastructure, without needing to care about running these clusters and configuring them. But what they actually do with these clusters,
that’s totally up to them. And that’s the great part about this because the infrastructure team, they don’t even understand, they don’t even need to understand what the users are doing with these clusters. They are fully focusing on the infrastructure themself and they are offering this as a platform that people can just use from the outside. And ultimately, what was our goal from this perspective here is that we wanted to simplify data sharing. And this is where we wanted to allow to use this governance layer which we already had around the central and also decentral data storages but also the central processing platform that we already have and allow you to plug in whatever tool you actually want to. This can be an RDS where you want to load data into or RedShift. Or whatever kind of system you want to use. EMR for processing, if you’re not happy with the central offerings or if you need additional capabilities that are not there. From that perspective, we are still providing central services but they are also globally interoperable.
And the interesting thing here is that even though we are decentralizing the ownership by making people responsible for the data they stored by themselves in their part of the system, we decentralized the ownership but it does not neccesarily imply that we decentralized the infrastructure. And we now are in the situation where we have decentral storage but we still have central infrastructure. And even though we have decentral ownership, we still have a central governance layer that allows us to tie all these things together.
‘Cause after all, interoperability is actually created through convenient solutions of a self-service platform. Now diving a little bit more into the behavioral side of things. So we, just a quick recap again, when we are coming from the situation where there was this lack of ownership and this lack of quality because the datasets provided through the central pipelines of this data infrastructure team that did not know anything about them was lacking the direct connection between the consumers and the producers. And this is exactly what we wanted to tackle as well by allowing people to make more conscious decisions. We came from the world where everything was archived per default, the data producer did not even have to know about these things. There were data producers that for a long time
were not even aware that their dataset is actually stored and archived, let alone being used by others. Exactly for that reason, we went with an opt-in strategy instead of default storage because when we force people to make conscious decisions, if they they actually decide to store a dataset in the central archive, then they will also have some responsibility for that. Then there’s awareness and that actually allows them to go much deeper into supporting this dataset. At the same time, we were also looking at the classification of our data usage. And this is where we also realized there were tons of datasets which were not used at all. They were simply a liability. They were laying around not only costing us money, but some of them even containing confidential information which then needed to get into scope for considerations like GDPR. There are some datasets which are used every now and then. They are probably fine like they are and you will just continue using them as you do as of now but then you will also figure out there are these golden datasets that you have, these high value datasets which are used by tons of people, which are integrated into tons of processes and a lot of really important decision-making processes are based on that and these are really the ones which you now have the understanding to invest more resources into providing the best quality possible. And this was exactly where we also tried to foster behavioral changes for data producers, because data should, as Arif already said earlier on, data should not be a byproduct but data should be the product. And this is really interesting because we already had similar views before in the company. We had embedded BI teams who were sitting with certain domains who were the domain experts, domain experts and they were the ones who actually had all the knowledge necessary to build a data product. The only thing that we now need to do and that we’ve been doing already for quite some time is to take this setup and also export it to distributive infrastructure that we now have.
And ultimately, this also leads to the point where if you have conscious decisions about the data products being available, that you also have now the knowledge and data again to dedicate resources to first of all understand the usage of your data product but also to ensure its quality. ‘Cause ultimately, the contract between the consumers and the producers, that is the data quality.
That was data mesh in practice, how Europe’s leading online platform fashion goes beyond the data lake. I’m Max Schultze, I was joined by Arif Wider today.
Max Schultze is a lead data engineer working on building a data lake at Zalando, Europe’s biggest online platform for fashion. His focus lies on building data pipelines at petabytes scale and productionizing Spark and Presto on Delta Lake inside the company. He graduated from the Humboldt University of Berlin, actively taking part in the university’s initial development of Apache Flink.
Dr. Arif Wider is a professor for software engineering at HTW Berlin, Germany, and a lead technology consultant with ThoughtWorks where he worked with Zhamak Dehghani who coined the term Data Mesh in 2019. Next to teaching, he enjoys building scalable software that makes an impact, as well as building teams that create such software. More specifically he is fascinated by applications of Artificial Intelligence and how effectively building such applications requires data scientists and developers (like himself) to work closely together.