Champions
of Data + AI

Data leaders powering data-driven innovation

EPISODE 21

The Power of Data Fusion

Yanyan Wu joins the Champions of Data + AI podcast to dive into the importance of having large amounts of diverse, clean and high-quality data to drive new insights. She also discusses how modern cloud data architectures, like the Lakehouse, make it easier than ever to combine data sets, structured to unstructured, to execute algorithms in near real-time. We’ll also explore how the work she and her data team are doing is helping oil and gas companies run more efficiently and generate more energy to help fill the gap brought on by current global events. If that wasn’t enough, she’ll also share the role Canada geese play in aircraft safety.

headshot
Yanyan Wu
VP, Data and Data Analytics, Wood Mackenzie, a Business of Verisk
Yanyan Wu, VP, Data and Data Analytics, Wood Mackenzie, a Business of Verisk, dives into the importance of having large amounts of diverse, clean and high-quality data to drive new insights. She also discusses how modern cloud data architectures, like the Lakehouse, make it easier than ever to combine data sets, structured to unstructured, to execute algorithms in near real-time. We’ll also explore how the work she and her data team are doing is helping oil and gas companies run more efficiently and generate more energy to help fill the gap brought on by current global events. If that wasn’t enough, she’ll also share the role Canada geese play in aircraft safety.

Read Interview

Chris D’Agostino:
We’re back with our next episode of The Champions of Data and AI. I’m your host, Chris D’Agostino, Global Field CTO at Databricks. Today I’m joined by Yanyan Wu, VIce president of Data and Data Analytics at Wood Mackenzie, a business of Verisk. Yanyan and I will not only discuss the role that Canadian geese play in keeping you safe as you fly at 35,000 feet.

Chris D’Agostino:
But more importantly, we’ll dove into the importance of having large amounts of diverse, clean and high quality data to drive new insights. We’ll cover how modern cloud data architectures like the Lake House make it easier than ever to combine datasets structured to unstructured to execute algorithms in near real time. We’ll also explore how the work she and her data team are doing is helping oil and gas companies run more efficiently and generate more energy to help fill the gap brought on by current global events.

Chris D’Agostino:
Let’s get started. So, Yanyan, it’s great to have you here today. Thanks for being part of Champions of Data, and AI.

Yanyan Wu
Thank you for inviting me. Great to be here.

Chris D’Agostino:
So let’s dive right in. You know, you’ve had a long career with data in energy and you know, you’ve got your PhD in computer aided design. Can you tell us a little bit about what got you into data analytics, artificial intelligence and kind of what inspired your Ph.D. research?

Yanyan Wu:
Yeah, I think the answer is probably what this a point of view because it’s not planned. I go with my passion. So at the time, several years ago, at the end of about maybe 2014 – 2015 in the time that data was hot, but not as hot as today. And that was just fun to learn about the data, how to visualize data, how to analyze data, how to deal with the big data.

Yanyan Wu:
That’s how I get into it. And then somehow my career shift with my passion. So I used to be, as you said, as computer aided design, a mechanical engineering, a product manager, then eventually guiding me through. So with the learning that that build up though eventually shift to a data and AI award.

Chris D’Agostino:
That’s awesome. Yeah. So I studied electrical engineering and you know, as part of that curriculum, we had to do some you know, mechanical engineering style work and do some of the CAD type capabilities. And it was always fascinating to me that types of systems you could model out and create 3D representations for. So we’ll you know, when we first got to know each other, you talked about designing aircraft engines.

Chris D’Agostino:
We’ll come back to that. But how did the how did they experience from working with aircraft engines and all of the data? Like, did that influence your passion for, you know, even more data analysis and in pushing you in that direction for your career?

Yanyan Wu:
Yeah, it’s it’s oh, I have to say, oh, my career is about data. It just in different formats to say yourself. Chris I used to be in 3D in geometry, now it’s in manufacturing manufacturer application design award. Now shifting to that energy data, it’s different format of data. It’s like a structured ordnance and the structure, but not of them is many times Cirrus Data is now 3D.

Yanyan Wu:
But allow my career if you look at all I’m doing is just deal with different format of data which is they’re changing but the philosophy same. Actually, Chris I like your major of Double E I know that’s something that if you let me pick again I probably was not going to because they for you major you talk about how to do a noise in noise removal, signal processing frequency domain.

Yanyan Wu:
That’s something I think in fusion we should look into more in how to combine Double E knowledge expertise with data and AI, I would like to do more on that side, so maybe I’ll can borrow your expertise down the road.

Chris D’Agostino:
I Well my expertize is a bit dated, but yeah, in terms of, you know, I spent my, my early career doing semiconductor design on sort of low level, low level sort of physics type stuff. And then I of course, studied computer science as well. So my career shifted, you know, as as like yours shifted very much from the tangible real world build something and actually something you can actually see and touch to you know, distributed computing software that runs on these machines and produces a result.

Chris D’Agostino:
But, you know, it’s it’s not the same as, you know, building tangible objects, which is something that always fascinates me. So in looking at, say, oil and gas and what you’re doing today, you know, we talked about when we first met and got to know each other a bit, we talked about the idea of, you know, if you think about what’s going on in the world in 2022 with a movement towards reducing carbon footprints, then you you that that challenge gets worsened, if you will, by the fact that there’s the Russia-Ukraine war and, you know, Russia turning off pipelines to different countries that aren’t in support of what Russia’s doing.

Chris D’Agostino:
So these these energy companies really have an a big challenge on their hands at the one on the one hand, they’ve got to maybe add or supplement energy for because there’s less available energy from Russia. So they’ve got to provide more traditional sources of energy and do it in a more efficient way. And then at the same time, they’ve got to be thinking about how do they transition to cleaner energies and maybe completely alternative fuels.

Chris D’Agostino:
So, you know, collecting the data for how they’re running their current business and finding opportunities for investment to transition is a big part of your job, right?

Yanyan Wu:
Chris You’re right on. You’re right on. That’s how, you know, we got more data every day to go to work. And our mission is because our goal is going to have a sustainable world or in terms of the energy demand and just supply. It’s not gonna happen overnight. So what we of course, we also want to go to, you know, green energy, renewable energy.

Yanyan Wu:
I think that’s the goal for everybody. However, it’s not going to happen right away. So we still have to work with the traditional energy which is on gas, especially as you highlighted the whole that at the beginning of broad by the by this Ukraine Russia conflict and and that with that I think that the importance of making energy sustainable for the world through the traditional oil and gas business especially is a extremely important now than ever.

Yanyan Wu:
The other challenge that our clients and we have in the energy business have is is how to improve the efficiency or improve the productivity for the traditional oil and gas business. How they do that, they have to have the data right. So a lot of things I think with my previous experience, what I see is every company has data.

Yanyan Wu:
No companies say I don’t have any data, they all have data. The problem is, if you look at just data, your own data on its own, it’s not going to going be enough? It may generate some competence and some insights, but you will get exponentially more value if you are able to get link the data with like enriched data from other data sources that which run now with McKenzie and the voice is providing.

Yanyan Wu:
So our role is mainly is with more enriched data that we’re able to collect, we’re able to help our clients to empower them to get bigger, more insights from the data they have.

Chris D’Agostino:
Can you can you talk a little bit about the strategy within the company of maybe identifying these alternative data sets and these third party data sets? You know, do you have like talk a little bit about the team and how you do the ideation around? Well, if we had this data set, we could maybe get this kind of insight that would help the oil and gas industry be more efficient or more productive.

Chris D’Agostino:
How do you how do you and your team go through discovering what data sets make sense?

Yanyan Wu:
Yeah, that data is enormous right now. So actually you don’t have to discover it, just a throw unto you. It’s a lot of data. The problem is how you prioritize them. How would you know which data source is better than the other or which more? Which one is more important for clients? Otherwise, you’re not going to have any time to sleep to eat because the data is enormous.

Yanyan Wu:
So a lot of the time we spend is identify with so many datasets. Which one is more important to our clients and which one is has a little bit better coverage and high frequency? And then maybe what strategy we have to use to combine the datasets. Just come back to my, you know, my previous days with GE was the aircraft engine business.

Yanyan Wu:
When we did one of the project we did is we call multi modality inspection, basically different inspection and the inspection methodology, it has different strengths. How could you get the best data sets so you fuse that data together we call about a multi modality data fusion we have a patent on that one for my previous business.

Yanyan Wu:
So the same thing as apply here for oil and gas business, you have to be able to come up with a way to fuse that data so you can have the best data sets for clients. So that’s how majority of time was spent. It is improved that, understand the quality of the data for different data source and how to come up with a strategy effectively efficiently to fuse the data together to provide the best data sets to our customer.

Chris D’Agostino:
So let’s talk a little bit about how technology in the in the data platform space has evolved over time. Since you’ve been doing this for a while and doing that data fusion, you know, if your experience is anything like my experience, the data was stored in different systems. You can do some analysis on those core systems, but it was really difficult to bring data together at scale and run algorithm that would, you know, give you new insights in a timely manner.

Chris D’Agostino:
Is that something that you’ve seen as well?

Yanyan Wu:
You know, if you look at through the years, you remember your Excel, time, excel, you know, everybody’s excel and Excel is the best, whatever you have. Yeah, but I think a lot of people know like, you know, all those tools now has a limitation, right? That data is just as I said, the data is, is growing.

Yanyan Wu:
It’s just expression growing every day and become bigger. Bigger Excel can deal with millions of rows very, very effectively. Right. That’s, I think, very effective tool to deal with. However, of the data is called growth. Beyond that, we have billions, billions of rows and hundreds of thousands of columns of those structure and structure data. How to find that platform that, you know, to enable you to do those data quickly.

Yanyan Wu:
Right. If think about in any and the automation or any building size, if you want to debug anything, right. How can we find a tool that you’re not waiting for hours? Just get a line of the feedback and say, you know, to to figure out if you are doing the correct thing or not correct thing. So in terms of those kind of capability, I think we find out a database is really a good platform for us to empower us to be able to effectively, efficiently deal with the big data like billions and billions of rows, and also open up the reduce all the lags at the time that was spent at the time we spent.

Yanyan Wu:
in terms of how to link the data from different sources. But it was at the other lake house structure that, you know, it really, you know, you have metadata stored on the database. You can bring all the data transparency onto the same platform. Everybody see what the data is, and it’s there for you. You can do the time travel.

Yanyan Wu:
So you don’t need to worry about if I mess up with this version of data type, getting my older version back. Yeah, you can just call the previous version back very, very easily and you can you can figure out with the auto load all those capabilities, you can find out what’s the out of your data. So all those can be done that not very easy in-house on data, database, platform.

Yanyan Wu:
And that’s something that we cannot imagine before. And another one of the lot of benefits we gain from that data is that efficiency through the schedule the work. So used to be like you know, you have to like get up certain day, a certain day, and in the morning you start a routine. Now you can be automatic schedule and get informed once the job is done.

Yanyan Wu:
So all those time that we saved, now we can focus all our you pull up data quality, improved the data completeness, the and the standing that the data insights and the getting the goal. It really started to use our expertise to take the code out of the data we have versus just managing those logistics of data was the big data.

Chris D’Agostino:
Yeah, we, we talked to, you know, a lot of leaders and I is specifically in my role with the company talked to a lot of leaders and one of the common themes is returning more time to the user to do analysis. And so organizations are moving away from trying to do a lot of the infrastructure in the plumbing associated with data movement, with data cluster management and things like that that they don’t see that is providing competitive advantage to their organization.

Chris D’Agostino:
They say, look, you know, we really want to not have to worry about the the flow of the data. And in the availability of the data, we want to automate as much of that as possible. We want to really get our engineers and data science community and data analytics community analyzing the data and getting those insights and doing that more quickly.

Chris D’Agostino:
So so that’s real, you know, very much consistent. So yeah, and we have customers in the oil and gas industry shell Oil is is a big one for us. Dan Jevons was on this podcast and he’s done a lot of speaking engagements, you know, alongside Databricks in different forums. And so for those in the audience, we’ve got a great podcast if you want to listen to that, but we’d love to hear what you and your team at Verisk are doing because you know, with Shell, for example, they were doing a lot around supply chain analysis.

Chris D’Agostino:
They were looking at fatigue, equipment fatigue and when they might need replacement parts for an oil rig. Your company is really providing a really cool service to the oil and gas industry. So without, you know, maybe getting into specific customers, can you give us some sense of what type of support you providing to oil and gas?

Yanyan Wu:
So if you look at all Mackenzie’s offer, we are a data provider as well as we provide consulting service for energy, energy investing industry. The the data, as I said, the want the challenge of all every company is they have their own data. Right? Their own company data they know very, very well. However, they may not have the industry data that across industries that they like.

Yanyan Wu:
They either like their peers in the industry or are related, even some of them that are related vendor information, they may not have it. So what we provide, that’s where we start. So we provide the data that they may not have on their own. So it’s cross-industry across different companies, across different domain. That’s one aspect, for example, like, you know, if, if one operator is joining in one region, if they want to know what happened to the other region, right.

Yanyan Wu:
What, what’s the production activity data, what’s the return on investment? What’s cost they that we can provide to them so they can use as a reference the benchmark their productivity features and see if they need to move to other regions to invest. And then also we another division. It’s not my division, but then the division will McKenzie provide real time energy data.

Yanyan Wu:
So they have sensors, they have network everywhere. For example, they can measure like the electricity generation, across states, across counties and the real time and the the oil and gas production rates through measuring sensor from the pipeline and all this. So the real time measurement data that that is another bit of wisdom if you want wait for I want to report the data which has a lag it’s not real time, maybe two or three months or whatever months it is for across different states.

Yanyan Wu:
You get those, the power generation numbers or on gas numbers or whatever energy related number that is, has a lack. But with the business that have they have real time measurement sensors everywhere and then they can provide you the real time data for it, relate to energy business. So we fill a lot of holes for our clients that that the pockets of the host that they don’t have the data that on their own that they can leverage the data we have so they can create meaningful link datasets to generate much more insights based on, you know, those that they that we provided.

Chris D’Agostino:
So then these customers use these data insights and data products to, you know, in theory make decisions that really influence millions of dollars of investment from their end right? So let’s talk about data quality and the importance of ensuring that you’re generating good data sets that allow you to create these insights.

Yanyan Wu:
Yeah, this is something that we actually there are three things that we mainly focus on. One is the timeliness of the data, and then the other is completeness and then the accuracy. So all of this is key. Oh, there’s three aspects, the key components for data quality. So it’s not like just checking correct or wrong but have to be timeless, has to be adequate, has to be complete.

Yanyan Wu:
We spend a lot of time on this. So complete means especially why you have big data, right? Data sets, you are able to aggregate data from hundreds of thousands of data sources that put them together and check them and look at that data profile. Remove the noise, use efficient noise removal algorithm. You cannot just eyeball have to have automated routine to do it all.

Yanyan Wu:
This has to have a reliable platform like we use. We use the cloud providers. Our platform, including Databricks, which is a expert, has specialize in the big data processing that enable us to improve that data quality. That to the point that we’re without those tools, it’s not possible to do.

Chris D’Agostino:
Okay. So we’re going to we’re going to transition to some fun topics here that are just outside of the data space because of our shared background in airplanes and in passion for that. But you know, in summary, it’s I mean, it sounds like in your career, you’ve always been you’ve always been around data. You’ve been following it as a passion more than something you just set out specifically to pursue.

Chris D’Agostino:
And so your passions have taken you into the realm of data and that, you know, different data analysis and different types of data sets and different types of insights. But the common theme is these organizations collect a lot of data. They have a lot of internal data. What you find yourself being able to do now is combine that internal data with external data sources and find ways to add value in.

Chris D’Agostino:
You know, you’re running a team to to provide that for your customers. That pretty fast summary.

Yanyan Wu:
Yeah, it is. Chris Just add to what you just said. It’s exactly I think the same is always the same, right. No matter what the role I was on. For example, I was at the time that I was doing an aircraft engine research project for GE. Now with all McKenzie and Verisk with providing data energy data to all kinds, it’s all about how you gather data.

Yanyan Wu:
How would you ensure that with the data that you have and you combine them and get the best data to you? Making the best decision I just to give an example, we used to have a patent application on this. It’s called the Multi Modality Inspection Project for Aircraft Engine. So basically if you if you understand a need, which is a nondestructive evaluation for inspection for aircraft engine, you have to find out different and the modality has different strengths.

Yanyan Wu:
Ultrasound is good at identify the layers. So dissemination is something that it’s very good at and X-Ray is good at proximity identification. So if any density change is very sensitive to that and coordinate measuring machine, CMM is good at measuring the X-Y-Z coordinates. So if you put them together, you get best data sets, right? We call data fusion, multiple data inspection at a time.

Yanyan Wu:
The same thing as we are working now on the energy data sets. As I mentioned before, the data is enormous. And how would you combine that data from different datasets, different sources, and provide the best data to your clients and the same thing. It’s also data fusion, right? So all of that as it doesn’t matter what it’s in the 3D world that they tell you in 3D, which geometry 3D, that’s my aircraft engine time and now a time

Yanyan Wu:
Cirrus data structure and unstructured time cirrus data. but the seam is same is it’s always about how you understand what’s the strategy can you build a team around they build a strategy, identify the best platform like Databricks or other providers best tools to make you to understand your data more efficiently. Be able to process data efficiently and combine the datasets to fuse them.

Yanyan Wu:
So to enable the best data sets for your clients so they can empower them to make the best decision.

Chris D’Agostino
So I mean, the thing that I love about, you know, as you know, I like aircrafts a lot, you know, an aircraft engine design. Like, the thing that’s fascinating to me about an aircraft engine, especially jet engine, is it is its own powerplant. Right. And we all the things that we’re talking about kind of globally with reducing carbon footprint is important there.

Chris D’Agostino
Right? You want these engines to be as fuel efficient as possible in the emissions, to be as low as possible. So there’s a lot of engineering that goes into the design of that. These engines obviously have to be reliable because you don’t want the engine to cut off in the middle of the flight. And so part of that is, as you say, right, looking at the materials science of how the fan blades are created and making sure that there is lightweight as possible, but they’re as strong as possible.

Chris D’Agostino
And, you know, there’s the people that study, aircraft design and things like that know that, you know, birds are used to test aircraft engines. You know, they’ll run an aircraft engine in a in a test facility and a bird will go into it. But, you know, I learned from you that it’s not just any old bird.

Chris D’Agostino
It’s a very specific type of bird. So share with the audience what that is.

Yanyan Wu:
Yeah, it’s the it’s aircraft engine probably is the the most strict industry that I think I have worked with because it relates to life right. You know anything you do is related to the safety of the people. So if you look at, you know, it’s all about actually data tracking, right? So it when I was working on the aircraft engine projects, you go to those plants, it’s fascinating.

Yanyan Wu:
But if you look at it, it’s all about data, not like, you know, you make you make a pot of water, you make a computer and then you probably everybody looks the same. But for our airplanes, every plane has a life certificate has a serial number. You trace it, you can trace it where the material where the material comes from, when there was made, which step was process.

Yanyan Wu:
It has a certificate like, you know, the aircraft engine. It’s the cleanest structure you can ever imagine that it’s actually it’s better than my kitchen. It’s a very, very clean and everything were checked and then everything you do has to be certified. But what do they certified, it’s all about data right? So if you look at today at the energy business that they that we go with it’s more like it’s different scale because we’re not tracking individual drop beds or each individual well wellhead we’re tracking a batch of the field that it’s in the production.

Yanyan Wu:
But then it’s a different challenge right then your scale of the data goes up. So aircraft engine at the world is you. How do we make sure that the risk is that in each individual one has that has a lower risk including and you have to shoot the Canadian geese and to that to the program based on the eyes and all this thing you have done.

Yanyan Wu:
But now in today’s world you have to make sure that the massive data we have and had to reduce the data quality issue, you have to reduce the noise making sure on the mass scale that the risk is the lowest kinds when they use our data that they can be confident on the decision they make based on the data we have.

Yanyan Wu:
So it’s different but doesn’t matter to aircraft engine or energy data or its own gas or power plant. And as you look at the everyday, like majority of people we work with is that data is understanding data and the how to reduce the risk for everybody, including live, based on the data we have.

Chris D’Agostino
Yeah. So that’s fascinating. So for people that fly, you should be thanking the Canadian geese for your safety is what you’re telling us. So the we also talked with Stewart Hughes from Rolls-Royce Aircraft Engines. And that podcast is interesting because of the way in which Rolls-Royce monetizes the aircraft engine. So if you’re tuning in and you haven’t heard that podcast, it’d be great to to go there and and, you know, go to our website and have a listen.

Chris D’Agostino
So, okay, well, let’s close it out. So, I mean, you’ve you’ve followed your passion in your career. This is fantastic. What advice would you give to people that are you know, they’re hearing your story. You know, they’re learning about your PhD background and the research that you’ve done, the different organizations that you’ve supported. And now what you’re doing at Verisk help us, you know, what advice would you give to somebody aspiring to maybe be in your role one day?

Yanyan Wu
Always follow your passion. So I have other people approached me, especially folks just starting a career asking, you know, my advice and what do they do next as a step, I always give them a suggestion. Don’t just gauge where you need to go based on how hard it is today. It’s going to be hard today.

Yanyan Wu
It’s not going to be hard tomorrow. Look at other frontier that you have passion with. If you’re truly passionate about data, what I would suggest your just ask me about your mechanical engineer, your a Double E like you Chris, a Double E major. What do you do? Take some class, take some actions if you want to, and then don’t think about as in this world, right?

Yanyan Wu
It’s we’re used to be like you just become a manager to lead your team. Now in the data world and not only data in any technology world. They lead you they need leadership. True leader who has the vision next Steve Jobs and Tim Cook, those guys started they have they that I think the people likes them because not because of their managers because they’re very visionary they are hands on.

Yanyan Wu
They can do the work. They know the details. You know, he knows how to rescue the project because he had the knowledge and the he has the experience enable them to do it. And people view them as value added, a leader. So that’s something that, you know, I would suggest anyone that aspiring to be in my role and then going to be in the leader in the in the data world, you have to think about it.

Yanyan Wu
What can I do to build up my skillsets, my experience, so enable me to become a visionary leader?

Yanyan Wu
Thank you for joining this episode of Champions of Data and AI brought to you by Databricks. Thousands of data leaders rely on Databricks to simplify data and AI so data teams can innovate faster and solve the world’s toughest problems. Visit Databricks dot com to learn how data leaders are unlocking the true potential of all their data.