Fluvius is the network operator for electricity and gas in Flanders, Belgium. Their goal is to modernize the way people look at energy consumption using a digital meter that captures consumption and injection data from any electrical installation in Flanders ranging from households to large companies. After full roll-out there will be roughly 7 million digital meters active in Flanders collecting up to terabytes of data per day. Combine this with regulation that Fluvius has to maintain a record of these reading for at least 3 years, we are talking petabyte scale. delaware BeLux was assigned by Fluvius to setup a modern data platform and did so on Azure using Databricks as the core component to collect, store, process and serve these volumes of data to every single consumer and beyond in Flanders. This enables the Belgian energy market to innovate and move forward. Maarten took up the role as project manager and solution architect.
During this session, Maarten will talk you through the case and highlight the key takeaways that made Databricks the perfect fit for a true modern data platform.
Speaker: Maarten Herthoge
– Hello and welcome to the data and AI summit. My name is Maarten Herthoge. I am team lead for the data science and engineering team at the Deluxe entity of Delaware. An IT consulting company founded about 15 years ago here in Belgium, and is now delivering global IT services. The idea behind this session is to give you an insight into our lessons learned when dealing with modern data use cases. The idea came from a type of workshop that I gave often quite often at customers. And I tried to condense the workshop in the next 20, 25 minutes. Now, instead of just talking about these lessons learned in theory, I will be applying them to an actual customer case that I personally designed architected and guided to implementation as a project manager. So let’s get started. When we think Modern data use cases. I always like to take a look at some markets. I like to understand what vendors such as Microsoft Databricks SMP are telling me, but this year we did a survey of our own. We asked our customers, what are you doing with data? And we were actually quite surprised. There’s a lot of ambition in the markets. And over 90% of our customer base indicates that they are currently doing data-driven innovation. Now at the same time, we also noticed something else. And that is that a lot of customers are struggling to get to a certain level of maturity. We found out that’s just a little bit over three out of five companies could tell us that they get some form of measurable results from the data efforts. Now, a result can be both positive success, but it could also be a lessons learned from a field project. So when asked about success and we got that not even one out of three companies was able to tell us black and white that their data efforts were successful. Now successful project could also be a book, can also be an experiment, might not make its way into production. So the final question was, did your data efforts made its way into production? Did you get true business value from it? And then not even one out of five companies could tell us that their data efforts made it all the way into production. Now, why is this number so low and how can you become part of this 17%? And to do this, we first need to understand why so many data projects fail. And actually this is quite easy. When we think about modern data use cases, then we are confronted with a lot of technologies, concepts, platforms, vendors, and companies lose global overview. They take in technology and they create what I like to call a data zoo. On average, we see that companies have seven different tools in house to do something with data. From my personal experience, I even had one customer who had 11 different tools licensed just for data visualization. Now we are confronted with all these technologies and this actually brings me to the first lesson learned. That is that, when you’re doing a modern data case, please start with the problem never start with technology. Technology is just a means to an end. If you don’t have data bytes of data lying around, then hadoop or spark might not be the tool for you. If you don’t have high volume multi-million streaming cases, then well going Kafka spark streaming might also not be the case for you at this point in time. So please start with defining your business goals, defining your metrics for success, and then move on from there. Also dare to call stop. It can happen that at a certain point in time, you hit a roadblock and you first need to fix this roadblock before you can come back to your problem. So like I said, I will be applying these lessons learned towards an actual customer case. And the customer that I have for you today is a customer called for Fluvius. Fluvius is the . And they actually unified the Flemish energy markets. So Fluvius is the single, responsible party for managing the electricity and gas grid here in Flanders. Now this gives Fluvius unique position, but at the same time it also gives them a lot of responsibility. People are looking at them to innovate, to gain climate goals. And one of the things that Fluvius understood what they needed to get to this point, is that you needed to become a neutral data broker about energy data. To get to this data. They were actually installing things that this guy is installing right now. These are the new digital meters for electricity and gas that are currently being rolled out in Flanders. Now I know there’s a lot of controversy around these meters. You’re not going to talk about that part, but they actually form quite an interesting case to build a data platform with. So let’s take a look at what Fluvius wants to achieve. Fluvius wants to build the core data platform that can support a wide variety of use cases efficient wise. This means that Fluvius wants to interface this data and be able to scalably share this data both internally but also with external players. And like I said, this is a vision it’s just way too broad. The first problem that Fluvius came to us with is that they wanted to create a customer portal where you as an end consumers are able to access up to three years of your consumption history. And I mean, this is quite a well aligned case. I know it’s, everybody can imagine this. You go to a website, you login and you effectively gets an overview of your user’s history. So, I mean, this already gives quite a lot of value in a sense that suddenly you have these millions of households that are able to access this data, verifying if its correct, verifying if there may be an issue with the digital meters and essentially building together with you on the data quality of your data estates. So this is the use case we will have to tackle. Let’s take look at the input site. And the input site is of course digital meters, these meters they represent quite a lot of data. First simply because of the sheer number of meters that are going to be installed. After full roll-outs, we’re talking over 7 million IoT devices that will be in the fields and Flanders that we need to take in. Now, when I say a lot of data, I truly mean that they sent a lot of data. We’re talking up to 12 terabytes of data, not per year, not per month, not per week, but every single day, we can be confronted with a case that we get 12 terabytes of data into our data platform. Now combine this with the requirement that Fluvius is building a customer portal where you can access up to two years of your users history. This means that you’re looking at a data platform up to 12 petabytes of data. Gets even more interesting in the sense that consumption data is of course your data it’s person bound. So we have this thing called GDPR and also other strict regulations from Belgium and Flemish government, that this platform has to adhere to. So to summarize Fluvius wants to create a core data platform that can ingest data from over 7 million IoT devices up to 12 terabytes of data a day. Oh, and by the way, it has to be fully GPR compliance. No when looking at a case like this, people with some experience might tend to look at a golden Hamish. You might like to go for the things they already know and try to tackle all these problems with the architecture, with the components they already have in place. Now, this idea brings me to the second lesson learned and it is that, the ‘golden hammer’ approach does not exist or does very rarely exist. Please when thinking more than that, data use cases, think about creating a portfolio. Try to create a true turn dial approach, where you have the flexibility to scale up or down. When you have the flexibility to ingest multiple sources into your data estates. Now, what do I mean, what this and practice. Lets take a look at the most common data flow I think there is and that is the classic data warehouse flow, very well known. You have your data sources, do some ETL on it. You load it into a data warehouse, and then you finally build some reports, some analytics on top of it. Now this classic data warehouse comes with some challenges nowadays. First challenge is that we are confronted with an ever increasing data volume, especially with the digital meters. We will capture quite a lot of data. And from my experience, a typical data warehouse is built very monolithic. So this means you can scale up but scaling out, isn’t that easy to achieve. Second challenge that we see, is that we find ourselves in the needs to collect and combine any data especially in the Fluvius case. This is very important in the sense that, today we have one hardware vendor of the digital meters, maybe in two years time, that can be multiple. So classic data warehouse flow does not play always very well with collecting all these different types of data, consolidating all this inputs. Third challenge that we see is that people are moving away from a purely consumption view. They’re moving away from, being a data consumer to becoming a data Explorer. You have data scientists within your company who wants to take a look at the raw data before this cleans and puts into your data warehouse to maybe check if you’ve missed something, if there’s still more value in your data that you aren’t using today. The fourth challenge is across the board and that is that real-time is becoming the standard. It used to be okay that you had a nightly job and that the data in your data warehouse was a couple of hours old maybe nowadays, if you click a button, if you ingest in events, you want to see the impact of this event immediately across the board. Tackling all these challenges. One by one might be feasible with your current data warehousing setup. Now, if you look at all these challenges at the same time, then for one you get quite a messy slides, but this mess is also what’s happening with your data warehouse. So what I like to do in cases like this is I’d like you to take a step back. I like to think about conceptually, what does the data flow mean? What different steps do you have in any data flow? And then again, you have your sources on the left, your use cases on the right. You actually different steps of your data pipeline. So you ingest data, you store it one way or another, you process it and finally serve it back towards your consumers. Now, if you map the Fluvius process to this, then for the ingest parts we see that we will need to be able to ingest up to 7 million or over 7 million IoT devices. Storage device. These devices will send us quite a lot of data, but not from day one. You don’t install 7 million devices in 24 hours. So this volume will increase continuously for the processing parts. If your storage can scale, then for sure the processing should also seamlessly scale. You don’t want to be confronted having to review your infrastructure week by week, just to cope with increasing data volumes. And then finally for the serve parts, Fluvius wants to build at this point in time API layer that can handle at least 1000 API requests per second. Let’s map out some technology, let’s try and solve the Fluvius data pipeline. And let’s start with ingest and store. So when thinking about the use case like this, I know some people of you are currently shouting at your screen and telling me, okay, just use a data Lake. Yes, a data Lake is a very solid technology, but I would like to say about a data Lake is please avoid the data Lake pitfalls. It’s a very solid sick technology, a very solid concept, but at the same time, avoid these pitfalls, try to be aware of them and try to get your data lake implementation to succeed. Now, what do I mean by this? I know yes, data lakes are often the go-to solution in cases like this because they have a lot of advantages. Number one, being they’re designed to handle a massive scale of data. That together with this design for massive scale of data, they also offer optimized performance on this data estate. Another major advantage of a data Lake is that it offers you quite a lot of integration flexibility. You can pretty much store anything in a data lake, whether it’s an image, an IT message, raw data, you can just put it into your Lake. Now fourth, a data Lake is quite a cost effective way of storing a large volume. Now reality is, and I got this number from Gardner is that, easily 60% of all data lake implementations fail. I even came across a video interview with some experts, actually telling, okay, this number 60%, this was what people were telling us, black and whites, but reality feels that this number is easily 80- 85% of data lake implementations that fail. Now, why is this? This is because every single advantage has an equally big pitfall. We call it a data Lake data ocean. We simply always referred to a data Lake as something that you can drown in. This is also what’s happens quite a lot. Data Lake becomes a data swamp answering old questions, questions that are not relevant anymore today. Yes, we have optimized performance but at the same time, we also need the capabilities, the right best practices to leverage this performance. It’s quite easy to put something on a lake, get it back but it’s not always that easy to apply the right best practices. Yes, we have integration flexibility. But at the same time, this also means we have schema on read. If different teams, different people are working on your data Lake, then they need to understand the structure. They need to understand what’s in your Lake to make sense of it or they will jump to the wrong conclusions. And yes, the data Lake is very cost effective but this is also the major reason why data Lake is very often mispositioned as a strategy. Data Lake as a technology its just a means to an end. Implementing a data Lake will not solve your data problems. If you put everything in your Lake and don’t anything with it, you just created one big single cost. Now, as long as we are aware of all these different pitfalls, we can actually make our data lake implementation succeed. So at Fluvius, we realized the advantages we of the, the pitfalls. So let’s take a look at the next step in our pipeline, which is of course processing. Now when asking customers what do you expect from processing components? That is that, yeah, they always tell me it has to be fast. It has to be scalable. They never tell me that they want to focus on collaboration. Well, this is actually common sense. If you’re building a huge data estates, if you’re consolidating company-wide all your data into your data Lake. If you have a processing component, that’s not able to enforce collaboration with your company, then again, you only have one big sum cost. So what does this collaboration aspect mean in practice? We want to aim at one unified analytics processing engine. So this means in practice again, you have some data sources on the left, you have some use cases on the right and we need to fill in the blank to understand what requirements do we want for this processing engine. And the first bill is, of course, like I already mentioned is that we want to have a collaborative workspace. In your company you will have different personas. You might have a business analyst who knows your process very well. He needs to be able to communicate his wishes to a data scientist who can help him with a proof of concept and experiments, and once successful those two personas want to pass it on towards the data engineer to put your experiments into practice so that you get business value from your data efforts. Second bullet and something that a lot of companies forget when you’re doing data innovation. When you’re introducing new tools, you don’t want to forget what’s already there. You have governance in place and an ID data pipeline. You will have some ETL or ELT. You will have jobs, you will have monitoring capabilities. Maybe you have an ISO certified change and release management process. So you also want to include some devils on top of it. So everything that you have built in your company with respect to governance with results of best practices, you also want to bring this along when you’re introducing a new engine, new components into your tool stack. The third bullet is to have a solid runtime foundation. You want some data Lake management from this engine. We want it to be serverless to scale without limits. And especially you wanted to pay and scale as you go. If today you have one gigabytes and within a couple of weeks you have on terabytes. You don’t want to pay now for one terabyte of capacity. You want to scale and bait together with your increasing needs, with your increasing use cases. Now all these functional non-functional technical requirements at Fluvius. We were able to tackle this by introducing Databricks from Azure. No Azure was a choice of Fluvius in the sense that Fluvius considered an Azure first strategy, but databricks as a technology stack on Azure as a tool on Azure, was able to cover all functional needs, all technical needs that Fluvius had for their core data platform. So let’s take a look back and how does the data pipeline at Flavius looks? So we have the digital meters or consumers on the other end. How will we fill in the blanks? Let’s take it left to right. So first in ingest and store for this, we chose for an Azure data Lake gen2, and we combined it with Delta Lake functionality. Now, why is this approach successful? This means that we have one unified consistent way into our data platform. It’s also a very cost efficient way for storing the field detail. These digital meters will send us checks information message overheads. Quite a lot of information. We don’t need this type of information, in every single use case. Now, when you think about it and we want to fill in the blanks for the server components, a lot of people will try to also reuse the components that they’re using in a storage layer. And this actually doesn’t always make sense. The ingest and store parts is a usage pattern that is very heavily focused on rights. Getting your data into the data platform while a serve use case is actually more focused on reach button. So for the surf use case at Fluvius, we went through it and I just sequel database as a caching layer and service fabric to develop our APIs. Now, the components itself, they actually don’t really matter. I just want to emphasize here is why is this approach successful? Why is decoupling storage from serve successful? That by doing this, you can create workload specific platforms tailored at a specific use case in your company can be APIs, can be data warehousing, can be machine learning. But decoupling these two means you can truly differentiate based on your various use cases. This is of course, that’s you filling the blank, the processing components with something that can act as the glue between these two. And I basically already gave it away. For the processing step at Fluvius, we chose Azure Databricks. Now, why is this successful? Why is this a good component for that data platform? Is that it gives Fluvius a unified room time for development for experimentation but also for collaboration on the data estates. In a lab environment, in an experimentation environment, it provides us with a one-click setup. And because you are running databricks on Azure, we also get quite an asset Vontage from Microsoft. Microsoft takes in Databricks as a first party service on Azure. Now this means that you got quite a lot of native integration with other Azure components. If you want to cover security with active directory, its there. If you want to do secrets management service principle management, you have key vault it’s there if you want to connect with various data sources, SQL database, cosmos database, data Lake, all the connectors are there provided by you by Microsoft. Now, again, these are all technological components. What this Fluvius gets from this data portfolio. Well, by implementing this data portfolio, Fluvius got an effective way to drive innovation in the Flemish energy markets. First and most important part is by having this data portfolio, it’s meant that they were able to bring consumption insight and more importantly, consumption awareness back towards millions of households in Flanders. By implementing this data platform, Fluvius’ was also to achieve a lower barrier for data and market driven change, not only internally at Fluvius, but also allowing external players in of course, legal framework agreements to also participate in innovating the Belgian energy markets. But it’s one last part and it’s not really a functional one, but it’s quite interesting to understand, When building a data portfolio it’s will contain a lot of components. And to some people, this might seem very complex, but we did an architecture study before we started the project Fluvius. And what we saw is when trying to create a golden hammer approach, of single stack solutions to cover the same functional needs. Then we noticed that every single solution was easily 20 times more expensive in terms of DCO when compared to the data portfolio. So yes, it might seem complex at first, but at the same time in the long run, it will pay itself back. Now what I want you to remember from this session Please always start with the problem. Never start with technology. Technology is not a strategy is just a means to an ends. Second, the golden hammer approach does not exist. Please focus on a portfolio approach. Please focus on a true during the dial approach so that you can differentiate. The data lake, beautiful technology, beautiful concept, but please be aware of the pitfalls as they can make your project fail quite easily. And lastly, and certainly not least important, please put an emphasis put a focus on collaboration. You are building a huge data estate. You want to be able to get value from your data estate. The only way that you can achieve this is by focusing on collaboration, allowing your entire company, all your people to work together on your data estate. Now, if you’re interested to learn more, please feel free to check out delaware.ai. This is our public facing AI branded website, where we put an all topics around data-driven innovation. But we also put in some customer stories that we can share to give you some inspiration about doing data-driven innovation at your company. This leaves me with nothing more to say than to say thank you. Thank you for attending the session. Thank you Databricks for organizing the summits and for allowing me to speak. Now, if there are certain things that you’ve missed from the presentation, or if you liked it, please let me know via the feedback form. It’s your feedback is very valuable to us. We really like to understand where we can improve. And I like to also understand where I can improve to make this content even more relevant for you and upcoming sessions. Thank you and enjoy the rest of the summit.
Maarten is working as team lead for the Data Science & Engineering team at delaware BeLux, next to taking up the role as (big) data and Azure cloud architect. He specializes in the domain of data & analytics with a strong focus on data science and data engineering. Combining functional and technical expertise in the Microsoft (Azure) ecosystem with a background in business economics, financial controlling and computer science engineering, he strives to turn data into an intelligent asset. As go-to-market lead for data engineering Maarten has proven delivery experience in advising and managing complex data projects ranging from operationalizing AI solutions to tackling multi-million-IoT-device or multi-TB use cases.