Society depends on reliable utility services to ensure the health and safety of our communities. Electrical grid failures have impact and consequences that can range from daily inconveniences to catastrophic events. Ensuring grid reliability means that data is fully leveraged to understand and forecast demand, predict and mitigate unplanned interruptions to power supply, and efficiently restore power when needed. Neudesic, a Systems Integrator, and DTE Energy, a large electric and natural gas utility serving 2.2 million customers in southeast Michigan, partnered to use large IoT datasets to identify the sources and causes of reliability issues across DTE’s power distribution network. In this session, we will demonstrate how we ingest hundreds of millions of quality measures each day from DTE’s network of smart electric meters. This data is then further processed in Databricks to detect anomalies, apply graph analytics and spatially cluster these anomalies into “hot spots”. Engineers and Work Management Experts use a dashboard to explore, plan and prioritize diverse actions to remediate the hot spots. This allows DTE to prioritize work orders and dispatch crews based on impact to grid reliability. Because of this and other efforts, DTE has improved reliability by 25% year over year. We will demonstrate our notebooks and machine learning models along with our dashboard. We will also discuss Spark streaming, Pandas UDF’s, anomaly detection and DBSCAN clustering. By the end of our presentation, you should understand our approach to infer hidden insights from our IoT data, and potentially apply similar techniques to your own data.
Jim Brown: Hi everyone. Thanks for joining our session. We’re going to be talking about Improving Power Grid Reliability U IoT Analytics. My name is Jim brown. I’m Director of Data Science at Neudesic.
Minjie Yu: Hi everyone. I’m Minjie Yu. I’m a Senior Data Scientist from DTE Energy on Distribution Operations.
Jim Brown: Thank you, Minjie. If you’re like me, we kind of take grid reliability for granted. I mean, just about every time I go to use power, it’s always there. I turn the lights on, they come on. Every time I use my computer or charge something, I always have power. But recently, a few months ago during the COVID lockdown, I lost power for pretty much the entire day. I quickly realized how much of our modern lives come to a close as soon as you don’t have that power. I couldn’t make coffee in the morning. I couldn’t use my microwave. I couldn’t use my electric stove. That was like everywhere I went, nothing was ready to go. All I had left was my cellphone, it had about 20% of power, and I just had my iPad and whatever other battery I had. It really was about just losing our conveniences.
But if you live in Texas, you have a much, much different story with grid reliability. Actually, during this last winter, it was such a bad winter, that many people went without power for a long period of time. It actually became deadly in many circumstances. That points out that not only do we, as people and consumers, depend on grid reliability, we also very much depend on strong utility companies that deliver us that power. That’s why I’m so proud to be working with DTE on such an important problem. Here’s what we’re going to be talking about today. We’re going to spend just a few more minutes on company introductions, and then we’ll go over the business challenges and this is what DTE was facing. We’ll work on describing the solution architecture that we’ve developed and then we’ll do a demonstration.
Just a little bit more about Neudesic. We are a services consulting firm. Our mission is to help our clients land on the winning side of digital transformation. We’ve been around for about 20 years and in about 10 years we’ve been doing utilities consulting and we have a utilities vertical. DTE is one of our primary business partners and customers. We have quite a few other utilities that we work with as well. We have quite a number of technical offerings ranging from data and AI all the way through hyper automation and many different types of solutions. We use our data and AI platform accelerator, which helps companies on to Databricks and the Delta Lake, and being able to quickly get that started and to be in the build models and solve your business problems. Next Minjie is going to talk about DTE.
Minjie Yu: How about this? Okay. Thanks Jim. First, let me briefly introduce DTE Energy to you. DTE Energy is a Detroit-based energy company. For DTE Electric, we have the electric generation and distribution serving about 2.2 million customers in South Eastern Michigan with a system capacity of 11,084 megawatts. It is the largest electric utility in Michigan, and also the largest in the nation. In terms of DTE Gas, it is engaged with the storage, transmission, distribution of natural gas. We have about 19,000 miles of distribution and serve about a 1.2 million customers in Michigan. DTE also introduced a non-utility energy business, which focused on natural gas pipelines, gathering and storage. Also, power and industrial projects, as well as energy marketing and trading. You can see a picture of a Michigan here. Our service territory is about 7,600 square miles.
The aspiration of our DTE is to be the best operated energy company in North America, every department, every division, all of us focused on that. For us distribution operation, we want to provide our customers with a more reliable and a safer power future. Now, this is basically about our project we implement. The main business challenges we faced is that the current strategy depends more on the historical information at a high level of system and are mostly considered on long duration outages. It depends on the expertise of our engineers to know where’s the problem, what to do and how to manage this. The key aspect of our project is try combine a lot of holistic information and the other ones, the analytics. Then we can identify those small areas that are still risk for reliability and then we define them as hot spots.
Then we can empower our engineers with a lot of information that are more timely and are specific about the things they can do right now. This is an example of two major reliability actions that we needed to decide. Since the fallen trees are responsible for nearly 70% of outages. We have a lot of tree trimming crews geared out to maintain those trees. This is a huge investment for our DTE. We have to decide well if we should send those crews and what they should have focused on. Another main cause of the service quality issue is the aging equipment. Our engineers also have to decide which equipment is coming to the end of life and what part of it needs to be replaced or just reinforced.
When the outage happen, for now, we have to rely on our engineers’ expertise to decide what action they should take to resolve the issue. This is really a big challenge and is something really difficult for them to do. We think a lot of this is really a data problem. One of our models purpose is to identify the course of those reliability issue, it’s either due to tree or just due to aging equipment, then we can give our engineers a good suggestion on what, anywhere, is the problem. Then they will know what resource are needed to solve this problem.
The first data source available to our project is from the AMI meters. We have about 2.6 million AMI electric meters across our system. We got a ton of data from these Smart Meters every day, including the usage and service quality data. Those AMI meters reports the usage data every 15 minutes for third-party phase meter and hourly usage for the single phase meters. In the meantime, they also report the service quality data of the outage information when the outage starts or ends. It also keeps track of all the outages and voltage alarms in real time. Also, they will record all the voltage events when the voltage move beyond or below standard threshold. We also get the actual voltage readings for each five minutes from those AMI meters. There are millions of data points from the smart meters stream into our system every day.
We also have another set of information about how our equipment are connected together in a distribution network. We use NetworkX to develop this connectivity graph instead of our solution. Since we want to search for the hot spots, this connectivity graph is that of just a focus that show graphic distance. The left graph here shows the electric system. The electricity is generated from the power plant and transmitted through the subtransmission system then goes to the distribution system. Our project’s focus is in the distribution. We have the equipment chain from the subtransmission to substation and then goes to the circuit. From the circuit to transformer, there are varying number of device between them, such as fuse, switch, repeater. Then transformers is connected to meters and the meter is the lowest level of the system.
The right graph shows an example of the connectivity graph. You can see for this one subtransmission, it is conducted to three substation, and then they come back to several circuits. Between the circuits and the transformers, there are a lot of another devices. All the meters are attached to their transformers. The distance between two meters in our project is defined as the number of hops between them on this connectivity graph.
For example, if you look at the pattern line, the last two meters are next to each other from the geographic perspective, but they are actually come from different substations. They probably have a very different power quality behaviors. They are actually far away from each other in terms of the equipment of the system. This graph will help us to better understand where is the problem and what is the suspect or agreement with the issue. As you can see, we got a huge amount of information every day. All of that together makes finding insights so difficult. We can look many different place. We could analyze different piece of data in many different ways. How do we make this analysis useful so they can help out engineers to understand what it is. Jim will talk more about our solution architecture for this question.
Jim Brown: Okay, here’s our solution architecture. This is simplified to just let the main points come out. On the left, we start with the AMI meters. Those are the smart meters that Minjie talked about. They’re reporting those three different types of events. We have information on the outages, their usage, and also low high voltage events as those occur. Those are transmitted wirelessly over a mesh network. That’s all part of the AMI infrastructure and that all gets brought together and then ready for ingestion by this whole AMI infrastructure piece. That goes into Azure Event Hubs and then gets streamed using Spark Streaming into a landing area, which is a delta table that has all of that information inside of it. We have several Azure Databricks notebooks that go over that data and then summarize it and enrich that information so that we can use that to detect the hot spots and also the suspect devices.
An important piece of information is that OMS, that’s the Outage Management System, and what that is is when there’s an outage that occurs and DTE sends a crew out there, the crew actually records what they replaced and what the cause of that was. If a tree had fallen over on a line, they would report that as the issue that happened. If, for example, it was an equipment failure, they would report that and what device they actually worked on. All of that information gets put into what we call the AMI Summary information. That contains quite a number of notebooks and summarization techniques to get it ready for further analysis. Across the top there, you see the section of hot spot creation, and this is where we take that AMI Summary information and then we do outlier detection on it.
We do a simple process using Mahalanobis distance. I’ll show you that inside of our demonstration when we start to look at this more. What we do is we take those 2.6 million meters and we take away the ones that are experiencing relatively normal reliability. We focus only on the extreme outliers. That ends up going into another Delta table of just poor reliability meters and that’s typically around 60,000 as what we’re finding right now. We combine that with that equipment chain, that device chain that Minjie was talking about as well. That’s where we use that distance matrix to go into DBSCAN. That clusters these based on their network hops and to groupings of hot spots. That’s one of the primary outputs of this are these hot spots that are being generated.
Another path that we do is we take that device chain and we try to figure out what sort of AMI events, what is the source, where is that actually coming from. This is something that we’re experimenting with and we think this is going to be a really good contribution to the industry in being able to take over a 24-hour period, all of these different momentary outages, voltage events, and taking a look at which devices are playing a role in a high percentage of those. In a typical day, as we’ve been looking at this, we see that there are just a handful of devices that are pretty much involved in just about every one of them. This gives DTE the opportunity to have someone check this out and go and work on this potentially before failures are occurring. It may be an early warning that something is going to happen in the network.
We’re really excited about that. Here, I’ll flip over to the demonstration, I’ll show you how we create the hot spots. Now we’re looking at a notebook that is basically going through the hot spot clustering piece. We pick up after the meters have already been put through the outlier detection. The way that works is we first do some aggregations and basically score each meter along the voltage events at different time intervals, we take these as well. How many sustained outages have happened and also momentary outages as well. All of that together gets computed with Mahalanobis distance and then we use a chi-square cutoff to be able to determine which ones are experiencing unusually poor reliability. We experimented with quite a few different techniques on what would be the best way to make this happen.
A big part of what I always do is I try to use the simplest method whenever possible. This one turned out to be, even though it is a simple multivariant technique, it turned out to be a very good way to find these. We tried Isolation Forest, they had just about the same sort of results, but this was a bit more explainable. That’s why we selected that. Sometimes you’ll see that we use things for simplicity and to make this process a lot easier as we move through this. Here in Cmd 4, we’re just loading the device chain. We pickle the sock so that it’s easy to be able to load it back in and be able to use it in various notebooks. The next piece here, and I’m not going to show the code because it’s just lots of business sequel code, what this does is it just combines the different meters throughout the tables. It just comes up with like what’s the source.
The sources, you’ll see in a little bit, it’s like ties and trunks, and it’s the major distribution parts coming into the network. Once we have that, then we start to compute our Pairwise Meter Distance Computations. This is a super resource intensive task. It’s really computationally intensive as well. What we use are Pandas UDF’s to process this. You’ll see this in quite a few of our examples, we use this in just about every technique, we use combinations of SQL Queries, the applyInPandas in Pandas UDF’s. We use the persist method on the data frames. We also use lots of temporary views. The reason we do that is, it may not be the most efficient way to make this happen, but it is the most understandable way because we’re data scientists, we have strong SQL skills, strong Python skills.
What’s great about Databricks is it gives us the ability to use what we’re comfortable in and to make that all kind of a seamless high performance environment. When we compute these distances, what we do is we divide them up into manageable chunks so that we can then apply them and have that task run across the different workers in the cluster. Here we do that with SQL. What this is doing is it’s using the CROSS JOIN to get the Cartesian product of the meters. It get all combinations of that grouped by the source node and slice. Here we’re doing it because its symmetric distances, we have that computation with this statement, and then up here and we use this monotonically_increasing_ID in the MOD statement to give it a slice assignment. You’ll see down here, this is what the source notes look like. This is the slice number. This gives us about 500 of these computations per batch. These batches are then spread out throughout the cluster and assigned to the worker nodes.
Here’s how we use this. It’s actually applied in the applyInPandas piece and it’s right after the group by down here. I’ll show you that first and then come back up. Here we use that SQL that I showed you just a second ago and then we apply a groupby on top of that, by the source node in the slice, and then we applyInPandas. Each task gets assigned one of those slices to process. That gets sent to this compute_distance method here. This function takes in that assignment as Pandas DataFrame, and then it applies the function. This is from networkX. It actually computes the shortest path between those two meters. That gives us the resulting DataFrame that comes back out of that as the source node, the source target and the distance that exists between those two.
The rest of this demonstration, I’m just going to do it with 10 meters so that’s with 10 sources. That’s what this looks like here. You just see that processing actually took about 45 seconds. One of the things we consistently do is we apply persist on the end like this and this lets Databricks manage that cache on its own. It will invalidate it if it needs to, but it makes this super easy so that this computation is only done one time. We get to use that speed boost anytime we access this temporary view.
Here’s some more on the Cluster Performance Considerations. What we found out is that computing that distance matrix was actually the most resource intensive tasks that we had. The network access gets shortest path function is called n squared minus n over two times where n is the number of poor reliability meters that you’re working on. When computed using Python loops, the first time we tried this, it took over 10 hours. We just gave up on that and decided that it’s time to redo this using Pandas UDF’s. Almost immediately we went with that technique. It turns out the DBSCAN is super fast for actually getting this done. We use DBSCAN because it is primarily a distance clustering technique, how far away these are from each other. It was a natural fit and you also didn’t have to select how many clusters for it to produce. It automatically selects that on its own. It seemed like the best algorithm for this situation.
What we do is we use SQL to actually craft that distance matrix, the result of that. It returns it in a way so that it can quickly be turned into a numpy array. We use masking to make that happen. This top part here, all it’s doing is taking that view that we had created before and returning it and sorting it in a way so that it naturally aligns with numpy. Here, you see, we’re creating that triangular matrix, that mask, and then filling it with zeros and then using that mask to just apply those distances that we came back with. When we take this matrix and we take it’s transpose and add it together, we get the full distance matrix. As you can see, this is the biggest one that we have in this batch of 10.
It’s a 646×646. That’s a fairly large distance matrix to compute. It takes just over three and a half seconds for that to complete. This is the main way that we achieve performance and then we decided to just sequentially move through those different source nodes and process those all individually. That’s what this looks like right here. This is the DBSCAN Clustering method. We’re basically just iterating over that list again. We compute and get those distance matrices and then we apply DBSCAN to that. We don’t want to take a look at hot spots that are less than 10 meters so that we can be efficient in what we’re applying our remediation to. Here you see, it just keeps track of that and builds that up as our hot spots.
It processed all 10 of these. Here you can see that one, that TIE 810, that we were looking at before. It had 208,000, just over that, number of computations that it had to do to get those distance matrix put together. Considering all of these together, it finished the entire task in about 23 seconds. What we’ve found is when we run this entire thing on the full set of data, it finishes in about 45 minutes. In less than an hour, we’re able to do all of this computation and identify which meters are actually fallen to these hot spots. Here this last command, this one kind of shows the results of the hot spots and kind of graphically lays that out for us here. What you’ll see is that this is the number of meters, the length of the bar size is the number of meters inside of a hot spot.
You can see that this is 132 meters for this one, and that’s hot spot 5. It appears quite often. It has hot spot 7 with 91, and you’ll see it shows up quite a few other times over here. That large one had quite a few groupings of hot spots of things that need to be remediated. Now, I’ll flip back over and I’ll just show you an example of what that looks like. For this one to example, so here’s what the results actually look like when you start to take a look at that. We zoomed out and what we want to do is protect DTE customers’ privacy. We’re not zooming into close so we’re giving you a view so you can get the feel for what this is doing without violating their privacy.
Here are the different colors that you see represent the hot spots that they were assigned to. With that blue one on the bottom there, you can see that it’s not hard to imagine that those are actually following the power lines. All of those dots represent meters that we identified as unusually poor reliability. They’ve been assigned to a hot spot grouping them because of their distances from each other. Probably, if we can come up with a remediation and identify what the source and causes for this and identify the right remediation that we can take, it should help improve power for all of these customers at once. At the bottom, what you see is an example of how we aggregate that information. Once these hot spots are identified, we take a look at all of the outage management system information, and we combine that together to show us what were the top causes.
Was it trees, what type of equipment was it that was failing and take a look and start to give as much information as possible about these hot spots. The actual dashboard shows much, much more information than just this that you can hover over it and dig into all that detail about it. But what this does is it gives the engineers what they need in order to make decisions on that. You’ll notice that even though we have 2.6 million meters, what’s happened is all of those normal meters have fallen away. The only thing that’s left are the really impactful areas that we called hot spots, where the engineer should be focusing their attention, because these are the ones that are experiencing the most poor reliability. That’s pretty much it. Please provide feedback to us. It’s really important. I just want to thank you so much for joining our session and listening to our solution. I hope this helps you in your solutions as well.
Jim Brown is the Director of Data Science at Neudesic, and leads the delivery of complex data science projects at our customers. He is skilled in several areas of machine learning, such as, computer v...
Minjie Yu, works at DTE Energy in Detroit, MI as a Data Scientist, responsible for collecting, reviewing, and structuring business data for interpretation. She performs quantitative and predictive in-...