Data points processed each week
Reduction in processing time
In operations through workflow automation
Fulfilling Wood Mackenzie’s mission, the Lens product is a data analytics platform built to deliver insights at key decision points for customers in the energy sector. Feeding into Lens are vast amounts of data collected from various data sources and sensors used to monitor energy creation, oil and gas production, and more. Those data sources update about 12 billion data points every week that must be ingested, cleaned and processed as part of the input for the Lens platform. Yanyan Wu, Vice President of Data at Wood Mackenzie, manages a team of big data professionals that build and maintain the ETL pipeline that provides input data for Lens. The team is leveraging the Databricks Lakehouse Platform and uses Apache Spark™ for parallel processing, which provides greater performance and scalability benefits compared to an earlier single-node system working sequentially. “We saw a reduction of 80-90% in data processing time, which results in us providing our clients with more up-to-date, more complete and more accurate data,” says Wu.
The data pipeline managed by the team includes several stages for standardizing and cleaning raw data, which can be structured or unstructured and may be in the form of PDFs or even handwritten notes.
Different members of the data team are responsible for different parts of the pipeline, and there is a dependency between the processing stages each team member owns. Using Databricks Workflows, the team defined a common workstream that the entire team uses. Each stage of the pipeline is implemented in a Python notebook, which is run as a job in the main workflow.
Each team member can now see exactly what code is running on each stage, making it easy to find the cause of the issue. Knowing who owns the part of the pipeline that originated the problem makes fixing issues much faster. “Without a common workflow, different members of the team would run their notebooks independently, not knowing that failure in their run affected stages downstream,” says Meng Zhang, Principal Data Analyst at Wood Mackenzie. “When trying to rerun notebooks, it was hard to tell which notebook version was initially run and the latest version to use.”
Using Workflows’ alerting capabilities to notify the team when a workflow task fails ensures everyone knows a failure occurred and allows the team to work together to resolve the issue quickly. The definition of a common workflow created consistency and transparency that made collaboration easier. “Using Databricks Workflows allowed us to encourage collaboration and break up the walls between different stages of the process,” explains Wu. “It allowed us all to speak the same language.”
Creating transparency and consistency is not the only advantage the team saw. Using Workflows to automate notebook runs also led to cost savings compared to running interactive notebooks manually.
The team’s ETL pipeline development process involves iteration on PySpark notebooks. Leveraging interactive notebooks in the Databricks UI makes it easy for data professionals on the team to manually develop and test a notebook. Because Databricks Workflows supports running notebooks as task type (along with Python files, JAR files and other types), when the code is ready for production, it’s easy and cost effective to automate it by adding it to a workflow. The workflow can then be easily revised by adding or removing any steps to or from the defined flow. This way of working keeps the benefit of manually developing notebooks with the interactive notebook UI while leveraging the power of automation, which reduces potential issues that may happen when running notebooks manually.
The team has gone even further in increasing productivity by developing a CI/CD process. “By connecting our source control code repository, we know the workflow always runs the latest code version we committed to the repo,” explains Zhang. “It’s also easy to switch to a development branch to develop a new feature, fix a bug and run a development workflow. When the code passes all tests, it is merged back to the main branch and the production workflow is automatically updated with the latest code.”
Going forward, Wood Mackenzie plans to optimize its use of Databricks Workflows to automate machine learning processes such as model training, model monitoring and handling model drift. The firm uses ML to improve its data quality and extract insights to provide more value to its clients. “Our mission is to transform how we power the planet,” Wu says. “Our clients in the energy sector need data, consulting services and research to achieve that transformation. Databricks Workflows gives us the speed and flexibility to deliver the insights our clients need.”