Databricks and Shell collaborate to simplify industrial time series data analytics on the Lakehouse

Developing a Time Series Lakehouse at Shell

Published: October 9, 2023

Energy4 min read

Written in partnership with Shell.

The energy industry is all about physical assets – from terminals, ships and pipelines to refineries and wind farms. All these facilities are monitored 24/7 by sensors which monitor temperature, pressure, volume and so on. This is recorded in the form of tags (which broadly represent a sensor) positioned within an asset hierarchy and each of these records data at intervals from every second to every minute. Shell has millions of these tags positioned all over our asset portfolio – each one creating a stream of time series data which has been historically siloed and often restricted to a small group of engineers operating the facility. There is no shortage of data: the data is stored in a variety of systems – mainly in historians but also in alarms & events repositories, DCS systems, process automation systems and other local databases. However, getting access to this data has proven difficult in the past. Not only are the systems usually located in heavily restricted environments within the process control domain, but these systems were rarely designed for large scale data extraction and analysis. By liberating this data and replicating or making it accessible via the cloud, we can use it to train machine learning algorithms on data at scale or conduct data aggregations at speed – e.g. we trained machine learning models to compare insights on operating conditions from multiple data sources, of the same equipment across multiple assets globally.

Real-time data consumption has increased 15-fold over the last 5 years at Shell. To enable the use of this data at scale, Databricks and Shell worked together to develop an open-source, cloud-native framework which extends the Lakehouse to accommodate the global footprint of industrial time series data sources across Shell's environment. A massive advantage that the Databricks Lakehouse brings is that the time series data can be enriched with additional contextual data sources, driving consistency in curating high quality data products. In addition to the asset hierarchy information and work order history from the ERP system, Shell is integrating permitting system data, shift handover reports, alarms, events, IoT sensor readings and much more. The code for the data ingestion element of Shell's Sensor Intelligence Platform is called the Real Time Data Ingestion Platform (RTDIP). It ingests and queries high volume, historical & real-time data for analytics professionals, engineers & data scientists, making the data available in the public cloud. While developed originally for use within Shell, the project was open sourced through LF Energy in 2022 and has already proven popular with other customers – currently being downloaded 25,000+ times/month.

The Real Time Data Ingestion Platform (RTDIP) – overview of how it works

RTDIP focuses on time series data standardization and interoperability across the energy sector, enabling Data Science, Statistical, ML & AI capabilities such as Optimization, Surveillance, Forecasting, Predictive Analytics & Digital Twins - and energy efficiency monitoring - all essential to accelerating the Energy Transition. It has become a foundational element for Shell's asset monitoring activities and deployed at facilities globally, including Renewable, Energy and Chemical Manufacturing, Integrated Gas Processing sites, Research and Upstream facilities. More than 3,000,000 sensors stream data continuously into the platform across Shell's global asset portfolio. This solution has been the foundation for several differentiated solutions such as digital twin technology, predictive maintenance, global AI safety monitoring system and real-time production optimization capability. To date, over 5 trillion measurements have been recorded in the system, and the solutions built on top of this platform are delivering cost savings and production increases. It has also attracted interest from sectors outside of industrial energy applications, such as supporting EV charging or enabling manufacturing in agriculture.

RTDIP can be deployed either to part of an open-source delta lake architecture or on top of the Databricks platform. It provides the ability to ingest and query data through a simple open-source python SDK and Rest API. This facilitates rapid integration for the existing rich ecosystem of time series and industry focused applications. RTDIP Pipelines, for data ingestion, make connectivity to time series sources and data transformation simple due to its modular design. Being able to select sources, transformers and destinations allows for simple plug and play development to build and deploy production grade time series ingestion pipelines. RTDIP also provides a rich array of powerful data querying capabilities including resampling, interpolation, time weighted averages, and circular averages (which are applicable to weather and wind turbines).

The Real Time Data Ingestion Platform has been optimized to run on Databricks. This means RTDIP is multi-cloud by design. RTDIP pipelines are tried and tested at a global scale to run on the latest Databricks Runtimes and RTDIP Pipelines can be orchestrated using Databricks Workflows. Queries can be run directly in Databricks Notebooks, via Databricks SQL or Databricks Connect to allow queries to be run anywhere. RTDIP also capitalizes on the latest Delta Lake developments, including how Liquid Clustering redefines how time series tables are designed for optimal query performance. And the RTDIP Query Builder makes it possible to run any RTDIP time series query on any Delta time series table that has a measurement column, a timestamp column, and a value column. The time series technology stack for industrial use cases is being redefined and RTDIP is delivering real world business outcomes today.

To learn more, please visit the LFEnergy website.

What's next?

October 1, 2024/5 min read

From Generalists to Specialists: The Evolution of AI Systems toward Compound AI

November 13, 2024/7 min read

Data intelligence reshapes industries

Never miss a Databricks post

Sign up

What's next?

From Generalists to Specialists: The Evolution of AI Systems toward Compound AI

Scaling MATLAB and Simulink models with Databricks and Mathworks