For each drilling site, there are thousands of different equipment operating simultaneously 24/7. For the oil & gas industry, the downtime can cost millions of dollars daily. As current standard practice, the majority of the equipment are on scheduled maintenance with standby units to reduce the downtime. Scheduled maintenance treats each equipment similarly with simple metrics, such as calendar time or operating time. Using machine learning models to predict equipment failure time accurately can help the business schedule the predictive maintenance accordingly to reduce the downtime and maintenance cost. We have huge sets of time series data and maintenance records in the system, but they are inconsistent with low quality. One particular challenge we have is that the data is not continuous and we need to go through the whole data set to find where the data are continuous over some specified window. Transforming the data for different time windows also presents a challenge: how can we quickly pick the optimized window size among the various choices available and perform transformation in parallel? Data transformations such as the Fourier transforms or wavelet transforms are time consuming and we have to parallelize the operation. We adopted Spark dataframes on Databricks for our computation.
Here are the two major steps we took to carry out the efficient distributed computing for our data transformations:
– Hi, everyone, thanks for joining this session, my name is Varun Tyagi, and I’m here with my teammate Daili Zhang. We’re a data scientists at Halliburton, which is an oil and gas services company, located in Houston, Texas.
And today we’re gonna talk about how we use Databricks efficiently to help build machine learning models for predictive maintenance. And before we get into the topic, we’d like to introduce our team first.
So we’re part of a department called Halliburton Digital Solutions. And the main function of ATS is to support and consolidate digital transformation across all product service lines, which we call PSLs. And so at Halliburton, we have about 14 PSLs, varying from drilling to cementing, to health and safety and to completion. And our digital solutions group provides common platforms and architectures for data warehousing and governance to analytics development, to business intelligence reporting. And, we streamline and consolidate various processes across, all the product service lines. And we also provide and build a strong talent pipeline or software and digital development within each PSL.
Dalli and I, were in the data science team within HPS. So our roles are to develop analytics and machine learning models to improve operational efficiency, increase productive uptime, reduce operational costs and provide insights at the right time to the right people to help them make business level decisions.
So at Halliburton, we follow the basic analytics development life cycle with some variations of specific to our use cases. So we have thousands of rigs all over the world and each rig contains lots of tools, to quit with various sensors that are collecting information in real time. And we also have large amounts of data stored in SAP and other SQL databases as well. And over the years, we’ve collected petabytes of data in different file formats like PDFs, CSPs, and then parquet formats as well. For each project, we first identified the different data sources and then collect and integrate the data into the same platform. And once we have ingested the data, we spend majority of our time cleaning, aggregating, and transforming the data. We then perform feature engineering with the help from domain experts. And then we train and test our models and we select models based on different metrics. But most of the time the metrics are chosen, such that they’re directly related to the economic impact. And we then deploy the models and show the results to a related personnel. And after the models are deployed, we monitor their performance to see check for any drifting. And then we diagnose and retrain the models if needed. No doubt the model training and testing is an essential part of the cycle, but in reality, it takes less than 5% of our time. And so we spend the majority of our time with the data ingestion and cleaning process. So what kinds of data do we have? We have three main types of data, first is the operational data, which includes a historical data and real time data from age devices. And most of the historical data is stored in what are known as ADI files. And ADI is just a proprietary format to Halliburton.
And for one of our product service lines, we have more than 500,000 API files, and each file is about three gigabytes in parquet format. So in total we’re dealing with about 1500 terabytes of data and we’re getting more and more, real time data from edge devices (murmurs) so this volume is growing fast.
The second type of data is the, is hardware configuration data, maintenance data, and some other event related data that comes from the field. And those types of data are normally stored in a SQL databases. And so for example, we have collected over, 5 million maintenance records in less than two and a half years.
We also use external data when it’s available, for example, we use weather data, geological and geophysical data when needed.
So in summary, we don’t lack data and we’re getting more data on a daily basis, but we do sometimes lack data with good quality and particularly historic data. But with the edge devices and a more rigorous data collection process, we’re starting to get a better, good quality data.
So here’s an example of a use case for predictive maintenance. The objective of this particular project was to reduce, annual maintenance costs by 10%, through a field operation optimization, based on avoiding failure modes or transmission assemblies. So the picture shown here, is a one of our fracking trucks in the field and the transmission of the truck is located in the middle and it connects the engine to a high pressure pump. And the role of the transmission is to transfer energy from the engine and drive the high pressure pump, which pumps the liquids into the drilling well. And so it’s a very expensive piece of equipment.
So for this project, we have to combine or marry the operational data and the configuration and maintenance data in a consistent way. In this process, we faced a lot of challenges, the operational data that we collect it’s sometimes comes in different sample of frequencies, varying from one Hertz to a thousand Hertz. So we either have to down sample or over sample different datasets and conduct synchronizations between the various sensor data and also between non sensor data like maintenance records.
For our maintenance related data, we have a lot of free text inputs. So we have to use natural language processing techniques to extract the information that we need. There’s also a lot of manual effort in coming up with, programming logic to pin down, which records belong to which equipment IDs, since the records come from various sources and each source may not have a consistently, formatted equipment identifier. And by the nature of the equipment being used, they’re quite frequently moved from site to site and job to job, and each job may last for a couple of or more days, but within each job the equipment is constantly starting up and shutting down. So we need to account for the transients and steady state modes of equipment as well. So we faced a lot of missing data and a lot of erroneous data from faulty sensors or bad user inputs quite frequently.
Thankfully we have data breaks, which helps us process the huge amount of datasets quickly and in a scalable way. And currently we’re running on data breaks, premium workspace, and everything is transparent to the rest of our team members. We share and collaborate with different notebooks, different utility functions, and go back and forth with different versions of notebooks through version control as well. We also leverage the advantages of Delta Lake to speed up querying and going back to certain time shots and that type snapshots of the data as well.
We also leverage pandas user defined functions, quite a lot, and pandas UDS they’re executed by Apache arrow to exchange data directly between the JVM and the Python driver and executor’s, and it comes with near zero C realization or exceed realization costs. And it’s vectorized operations have helped us, gain about a hundred times speed improvement over, ordinary high spark user defined functions for certain calculations. We’ll show a simple example during the demo on one of our use cases for Panda’s UDF functions, where we fill in the note values in the data frame with either the previous valid value or the next valid value.
For this case, a PI Spark user defined function took about, four to five hours, but with the pandas-udf, it took about six to seven minutes, and this was a huge improvement for us. No one wants just to wait for four to five hours, for per piece of character run, just waste time and computational resources.
For this project, we did a lot of feature engineering, we combined different features using different formulas based on domain expertise.
So for our projects, we’re dealing mostly with time series data. So we use a lot of techniques from signal processing. The example shown above, we use Windowed Fourier Transforms to obtain frequency domain features for different time windows. And as I mentioned before, the equipment that we study, they’re moved from job to job and site to site. And so they constantly, turn on and off, and load up different variable at different rates and shut down. So the first thing we did is to select sets of high load windows based on a threshold. So for example, we could use like a certain threshold for engine RPM and above that threshold, we would consider that a high load. So we selected windows of data based on those thresholds that contain continuous data.
In our original type series of spark data frame, each row corresponds to one time point and each column is a feature. So we were able to use the collect list function to collect all the time values and the corresponding data within each time window into an ordinary, and after this operation, the data for each time window gets collected and transformed to a single row in this part data frame. Then we’re able to perform the Welch Fourier transform on that data for each window and transform it into the frequency domain. Then we can select the peak values from the frequency compositions for each window, and then use those as features into the machine learning models. Frequency domain features like this helped us improve our model accuracy. So I will now hand it over to Daili to describe the machine learning process. – Thanks Varun. So after we finished all of the harder work lik data (murmurs) and data aggregation and the feature engineering and we trained, and tested the model with various methods, we tried the next Spark ML, and we tried deep learning, we tried Azure AutoML and XGBoost and Sklearn etc.
And we evaluated the models with various metrics, such as (murmurs) recall rate and F1 score and accuracy. So in the end, normally we picked the model based on its economic impact. So the table on the right, shows some comparisons among different ML packages that we explored. So as you can see from the table, so the deep leaning which use a LSTM on a GPU virtual machine, it took a days to finish one training. And Azure AutoML is relatively fast. and the provide a slightly better results for this specific use case. So in the end, and for this project, we realized, so basically the different models does not make any significant differences, but the different features and get into the models have far more impact.
So for this project, and we used the PowerBI to visualize the results and to show that to the end users. So we modularized the whole process from extracting the data and performing the data (murmurs) and data aggregation and creating features and calling them all through, to perform predictions and then writing the results and back to blob storage into different notebooks. So the notebooks are orchestrated to run through data breaks, notebook and workflows, and based on the business requirement and the way normally scheduled on the job to run every 24 hours. So we’ll show a demo in the end about how we utilize the (murmurs) or workflows to run the notebooks.
After we deploy the model, and we do monitor the model performance over time, to check if there is any model performance drifting. So what we did is, we store the predictions into blob storage continuously, and we also stored the actual results from the field into the blob storage continuously as well. So we show the discrepancy between the predictions and the actual results, along the time in PowerBI. So some alerts are set to send emails or text messages to our team based on some preset thresholds. So, and once our team receives the notes and we will investigate the model on drifting and may retrain or redeploy the models. So model management is a big part of the whole process, especially when we’re getting out more and more models.
So I’m proud to use an email flow, we manually wrote the model specific information into a CSV files and stored the models and its corresponding environment and running information into a blob storage with certain name conventions. That actually takes a lot of time, and also cause some confusions and inconsistencies over time, especially if we’re getting more and more models. So, MLflow offer the bad data breaks, it’s greatly simplifies the process with consistency and quality control. So in the end, we will show a demo about how we utilize MLflow to manage the models.
So that’s it, Anna will show the demo. – [Ann] For the demo, I want to share three items which either improve the performance and run the code faster or speed up the development process. So before I show the detailed code, I wanna show the configuration of the cluster first. So the cluster is called signal process, so if you go to your configuration page, it has up to 12 worker nodes and each worker node weighs 64 gigabyte memory, and the 16 cores, and the driver load is 256 gigabytes memory and weighs 64 cores. So it’s a big cluster because we have a bigger data set. So now let’s go to the first item, where we utilize pandas UDF function to do, one of the (murmurs) process, is called the fill in data notebook. So in a notebook, the first thing we do, we read in the data from the blob storage into the a spark data frame. So after we read in the data, we do some pre-process and the here, I’ll pull the size of the data frame, as you can see, the data frame has a 168.2 million rows and 74 columns.
So it’s a big data dataset. So here we’re trying to do is, we want to ferry the missing data for each column by using either the previous available venue or the next available venue. So we use Pandas UDF function to achieve this goal. So here’s the definition of the Pandas UDF function. So the first thing you do here declare, the pandas_UDF decorator. So what the decorator does is to do some pre process and the post process of the data into the panda function. So here’s a main body of the panda function. So, as you can see, it’s very simple and it’s pure on pandas and syntax because its a panda function. So (murmurs) function is defined and we call it a group by and there we call apply. So basically what it does, it split the data into different groups based on the pump ID, and then it applies a pandas UDF function we just defined each group independently. And in the end it combines the results back into a data frame.
So after that, we’d read that data out to the blob storage with Delta format. So as you can see here, the whole process took about seven minutes to run. So if you want to achieve the same goal that you would in Spark or Windows function to fail in the law venue or support each column based on the pump ID, it will take about seven hours. So the speed is almost 16 times, why? Attempts faster. So I would highly recommend that you to check the pandas UDF function out, and trust me and you will not (murmurs).
So the second item I want to show is,
Notebook workflows, and let’s go to the Notebook, we’re using Notebook workflows to run all of notebooks.
So it’s in the signal process, its automate MLflow on notebook.
So Notebook workflow, that’s your raw notebook from another notebook in a defined way, so it’s a compliment to present wrong command to the advantage of notebook or workflow, it can pass a set of parameters to the target notebook defined in widgets, and also it can return a value which you can be, or you can use that to condition other notebook (murmurs).
As we mentioned in the presentation, we modularize each step of the whole ETL process and the modeling process into different notebooks. And then we synchronize all of the notebook to run in one notebook. So it’s a more organized way and it’s easier to debug and manage the whole process. So the syntax of the notebook or workflow, it’s a very simple, so it’s called the dbutils.notebook.run so the first parameter is the path to the notebook, you’re are trying to run. So the second parameter is on time out in seconds, so this parameter is to control the time out of the notebook row, so target notebook row. So basically you don’t want to waste the resource to run a notebook for a very long time due to some code issues, which is waste of money. So the last parameter is a dictionary, so it basically passed the parameters into the widgets defined in the target notebook.
Here I didn’t return a value because we don’t use it to return value for anything. So let’s go to the target notebook to see how the witches are defined and located. So that’s the first one, select the condition data.
So, so as you can see in this notebook, we defined three widgets. So actually there’s the data breaks provides all kinds of widgets you can utilize and it’s a well documented on the website as well. So if you want to know how to define widgets just check the documentation out.
So it’s an last item I want to show is Mlflow.
So MLflow is used to managing the whole modeling process and also the models.
Currently we only use MLflow to do the model tracking and also call back the models to do their predictions. So MLflow has more functionalities and it’s growing as well.
So either kind of (murmurs) the models and then calls the models through the API. So the rest API, and also it can just to deploy the model into containers as well. So here we will only use that MLflow to tracking the models. So of course in order to use MLflow, you need to install the package as you normally do to use some other packages.
So the first thing we do, we create a experiment. So here we specify the experiment and the path and the name. So is in the workspace. And then we basically create the experiment of the,
(murmurs) and experiments will show in the workspace, it is here, transmission_exp_01.
So after the experiment is created, and we can look at the runs into the experiment, here is the syntax to log,
the parameters and the runs into the specified experiment. So in this application, we log all of the ETL parameters,
and also we log the hyper parameters from the extra boost, the modeling part. So, and the four days application, I log on the model of course, and also unlock the confusion matrix. And there’s a features that I used to run this model and also the features importance bar chart.
So after that, and all of the rounds, there were a lot of widgets experiment. So let’s go do the experiment to see what we unlocked. So double click that, and it would get into the main page of the experiment. So as you can see it marks the date, and run name and who runs that, and the resource and all of the parameters and matrix. So next stop I click one of them and gets into the details.
So here, as you can see, it’s not all of the parameters and metrics. And I want to show all of the artifacts we unlocked. So the first thing is called XGB cluster, so it’s the model (murmurs), it has three items, the ML model, so it basically specify all of the model information and it has the environment file, and it specifies all of the package dependence rate, we used to run model. And of course the model itself is in Parquet format.
And we unlock the pre-process as a model as well, so it’s terminal and we’re also unlock the confusion matrix as a picture, so you can download the picture or copy paste the picture into your report, it’s very convenient. And we unlock the feature importance, bar chart as well, and also we unlock all of the features into a GSM file. So these will help us to reproduce the model because we know what features have been used. It’s very convenient. So that’s it, that’s what I want to share today, and thanks for everyone
I am a senior tech adviser in Halliburton with the focus on predictive maintenance and process improvement. Before joining Halliburton, I had worked on power plant predictive maintenance, gas turbine simulation and modeling in GE and Siemens for over 8 years. I graduated from Georgia Institute of Technology with a PhD degree in Aerospace Engineering and a master degree in Statistics.
Varun is a tech adviser in Halliburton. Before joining Halliburton, Varun worked at TGS-Nopec Geophysical Company for four years and CGG for six years in Houston, as a seismic data processing and imaging Geophysicist. His last role at TGS was as an advising Geophysicist and team leader for cleaning, processing, and analyzing large 3D seismic datasets in the Gulf of Mexico, Canada, West Africa, and Brazil. He holds a B.S. degree in Electrical Engineering and a M.S. degree in Engineering Science, both from Penn State University.