Skip to main content
Solutions

Scaling for MHHS: how Octopus Energy achieved a 50x cost reduction in margin data engineering

How a team of three engineers re-architected Octopus Energy's data pipelines to handle a 48x data volume increase - and cut costs by 50x in the process.

by Saad Ali, David Poulet, Daniel Taylor and Ismail Makhlouf

  • What it is: How Octopus Energy re-architected its margin data pipelines on Databricks to meet UK MHHS regulation.
  • The challenge: MHHS multiplies data volume 48x (two meter reads per household per month → 48 per day), projected to add ~$1M/year to pipeline costs under the existing single-grain architecture.
  • The outcome: Three engineers rebuilt the pipelines in three months, cutting cost per settlement date from $23.63 to $0.48 — 50x cheaper than the MHHS projection and 2x cheaper than the legacy system despite 48x more data. Delta Lake Change Data Feed drove a 98.8% reduction in rows processed (25B → 300M) and lifted freshness from weekly to daily; Databricks Serverless enabled the rapid iteration window.

The energy transition has a data problem

The UK's energy grid is in the middle of its most significant structural transformation in decades. As renewables like wind and solar take a larger share of electricity generation, intermittency becomes a first-class problem: energy is cheap when the sun shines and expensive when it doesn't.

The existing settlement model - built on monthly meter reads and averaged consumption profiles - cannot price that signal accurately. And if you can't price it accurately, you can't pass the signal to consumers, and demand never shifts to match supply.

Market-wide Half-Hourly Settlement (MHHS) is the regulatory response. Every household in Great Britain moves from two meter reads per month to 48 reads per day. That is not an incremental change. For a supplier like Octopus Energy serving over 8 million customers, it is a 48x increase in the data points driving every margin calculation, every settlement obligation, and every commercial decision.

The data engineering implication is direct: without re-architecture, the infrastructure cost to run Octopus Energy's margin pipelines was projected to balloon by $1 million every year.

Why throwing compute at this doesn't work

The instinct when data volumes increase 48x is to provision more infrastructure. For Octopus Energy's margin data team, that instinct was quickly validated as untenable. The projected cost per settlement date under the legacy architecture was $23.63 - a 33x increase from historical norms. Multiply that across settlement windows, and the bill compounds fast.

However, the deeper problem was not compute cost - it was architecture mismatch. The legacy pipeline had been built around a single grain: monthly. Billing ran monthly. Settlement ran monthly. The entire pipeline was monolithic by design.

MHHS introduced a fundamental split. Industry cost data now arrives at half-hourly granularity - 48 data points per customer per day. Smart tariff customers with EVs and heat pumps need half-hourly revenue calculations. Standard tariff customers still settle monthly. Running all three through a single monolithic pipeline meant processing the entire dataset on every run, regardless of what had actually changed.

As Saad Ali, Lead of the Margin Data Team at Octopus Energy, framed it: "You can't just throw more compute at a problem like this. You have to rebuild and rethink your logic from the ground up."

The architecture: three streams, one source of truth

The team re-architected around three specialised streams, each optimised independently for its natural grain:

Settlement - Half-hourly granularity for regulatory settlement and cost allocation. Industry charges at 48 data points per day; this stream matches that grain exactly.

Half-Hourly - Half-hourly processing for smart tariff customers: EV drivers, heat pump users, and time-of-use products where the half-hourly price signal is the entire commercial proposition.

Monthly - Monthly processing for standard tariff customers, unchanged in grain but now reconcilable against the half-hourly data.

A "Job of Jobs" orchestration pattern manages dependencies and parallel execution across all three streams. Each stream is independently tunable - what works as a Spark optimisation for Settlement is not necessarily right for NHH.

Underpinning all three is the downstream consumption layer: a unified, multi-grain source of truth consolidating meter reads, smart meter data, and industry flows at multi-terabyte scale. This layer is the reconciliation bridge between monthly billing and half-hourly settlement - and it became the site of the single highest-leverage optimisation in the project.

Incremental processing: 98.8% fewer rows

The naive approach to the upstream consumption tables - reprocessing the entire multi-terabyte dataset on every run - would have meant unsustainable compute costs at the new volume.

Delta Lake's Change Data Feed (CDF) made true incremental processing viable at this grain. Instead of complete overwrites, the pipeline now reads only records that have actually changed since the last run. The result: rows processed per run dropped from 25 billion to 300 million - a 98.8% reduction.

Data freshness improved from weekly to daily. For the commercial team, that shift means margin visibility at the grain where pricing decisions are actually made - every morning, not once a week.

Note: the $1M in annualised savings figures cited below exclude the additional savings from this move to incremental processing on upstream tables. The full efficiency gain is larger.

Spark & Delta optimisation - and what to remove

With 48x more data flowing through the system, the team applied targeted optimisations validated by measurement across four categories:

Lineage and I/O reduction

  • Simplified lineage by consolidating data early in the pipeline, reducing downstream joins and shuffle operations
  • Data pruning: selected only the columns strictly necessary for settlement and pruned rows at the earliest possible stage, reducing I/O overhead before expensive transformations

Join and partition tuning

  • Broadcast joins for reference tables under 500MB, eliminating expensive shuffle operations on complex multi-key joins with date ranges
  • Liquid clustering was enabled across multiple tables for columns frequently used in filters and joins. Liquid clustering dynamically co-locates related records on the specified clustering keys without requiring fixed partition boundaries. Liquid clustering avoids the small-file problem, higher memory consumption, and I/O overhead that come from over-partitioning.

Trusted the optimiser

  • In several cases, Spark's Adaptive Query Execution (AQE) outperformed hand-tuned logic. The team removed custom optimisation code and let AQE do its job.

That last point bears emphasis: removing unjustified compute operations was as impactful as adding new optimisations. If you are running Z-ordering or ANALYZE without measuring their effect, they may be costing you more than they are saving.

Serverless as a development accelerator

Databricks Serverless made the three-month delivery window viable. Zero cluster startup time meant the team could iterate rapidly - write, run, measure, adjust - without waiting for infrastructure to provision.

The Serverless UI enabled side-by-side run comparisons, making it practical to isolate the effect of individual optimisations.

In the team's own words: "The testing and development process could not have been done without serverless. Using the serverless UI helped us to identify bottlenecks and make easy comparisons between different runs."

Results

MetricBeforeAfterChange
Rows processed per run25 billion300 million98.8% reduction
Cost per settlement date (projected MHHS)$23.63$0.48~50x reduction
Cost per settlement date (vs legacy)$0.71$0.482x more efficient
Savings per month-end run-~$83,000vs unoptimised projection
Annualised cost avoidance-~$1,000,000excludes upstream savings
Data freshnessWeeklyDaily7x improvement
Build time-3 monthsTeam of three

The $0.48 per settlement date is not just a 50x reduction from the MHHS projected cost - it is 2x cheaper than the legacy system had ever been, despite processing 48x more data points. Re-architecture delivered regulatory compliance and made the system materially more efficient than the one it replaced.

What this means beyond energy

MHHS is a UK energy regulation. However, the pattern it represents - a regulatory or business event that multiplies data volume at a finer grain - is not unique to energy. Any time a system moves from monthly to daily, daily to real-time, or aggregate to transactional, the same dynamics apply.

Four transferable takeaways from the Octopus Energy experience:

  1. Grain misalignment is the hidden cost driver. When a pipeline processes everything at the finest grain regardless of business need, you pay for it in compute, freshness, and maintenance complexity. Identify the natural grains in your data and align processing to them.
  2. Incremental processing transforms pipeline economics. The 98.8% row reduction came from CDF-based incremental logic, not Spark tuning. Start there - and remember the full savings are larger than the headline figure.
  3. Remove before you add. Audit existing optimisation choices before assuming you need more compute. Z-ordering, ANALYZE, and custom shuffle logic applied without measurement may be costing you more than they save.
  4. Trust the optimiser. AQE outperformed hand-coded logic in multiple cases. Before writing custom optimisation, test whether Spark already handles your case.

The bigger picture

In the words of Saad: "By making our systems faster and more efficient, we can offer smarter tariffs that help our customers use energy when it's cheapest and cleanest."

The reduced cost base does something specific: it removes the economic barrier to high-frequency data processing. That makes grid balancing viable as a product. That makes smart tariffs commercially sustainable. That is how data engineering at scale connects to the energy transition - not as infrastructure overhead, but as the commercial foundation for it.

MHHS compliance was the mandate. Making sustainable energy the affordable option is the mission. The data engineering is what connects the two.

Go further

———

Saad Ali is Lead of the Margin Data Team at Octopus Energy. Ismail Makhlouf, David Poulet, and Daniel Taylor are Solutions Architects at Databricks.

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.