Bridging the Completeness of Big Data on Databricks

May 27, 2021 11:00 AM (PT)

Download Slides

Data completeness is key for building any machine learning and deep learning model. The reality is that outliers and nulls widely exist in the data. The traditional methods of using fixed values or statistical metrics (min, max and mean) does not consider the relationship and patterns within the data. Most time it offers poor accuracy and would introduce additional outliers. Also, given our large data size, the computation is an extremely time-consuming process and a lot of time it could be constrained by the limited resource on local computer. 
To address those issues, we have developed a new approach that will first leverage the similarity within our data points based on the nature of data source then using a collaborative AI model to fill null values and correct outliers.
In this talk, we will walk through the way we use a distributed framework to partition data by KDB tree for neighbor discovery and a collaborative filtering AI technology to fill the missing values and correct outliers. In addition, we will demonstrate how we reply on delta lake and MLflow for data and model management.

In this session watch:
Chao Yang, Data Scientist, Verisk
Yanyan Wu, VP, Data and Data Analytics, Verisk



Chao Yang: Hello everyone, thanks for joining this session. My name is Chao Yang, I’m a Director of Data at Wood Mackenzie. Today I will join Yanyan Wu, the VP of Data at Wood Mackenzie to show you guys some of our recent work in bridging the completeness of big data on Databricks. We have filed a patent on this work, and also we want to thank our co-inventors who are not here [inaudible] at Wood Mackenzie for this work.
First, we will start with highlight the overview of our usages and also the limitations of existing methods we’ve encountered on the overview of our data pipeline. Then, we will move to detailed limitations, and we will show you guys some source code of major steps of this process. And to end, we will show you guys some tips we learned from this project.
Let’s start with why we started this work. We all know null values exist in almost all data set, it’s unavoidable. But a lot of times we are just not able, or afford to, throw them away as either our data size is limited, or those data is part of a key attribute. Also, on the machine learning model, they don’t like null value very well, and it doesn’t work with statements with null value. So that’s why we started this basic data.
At Wood Mackenzie we are experts in all of your gas data. We have a broad coverage from conventional/unconventional, to subsurface. Also, we have a strong focus in power by renewable. So we have our own data platform called LENS which can be integrated to client systems, and then [inaudible] through API services. [inaudible] The part of a working ground to improve our data quality, so the client can get a bunch of data on our LENS platform.
Before this project, we tried all the major popular null filling methods and a network full of filling fixed values or using statistics in the metrics like a mean, max, or minimum. One thing we discovered is those methods are giving [inaudible] accuracy. So I think because they’re isolated, that means we didn’t have either other attributes or only focus on one attribute at a time. So they need to know the relationship or association raising attributes. Also, those methods are kind of time consuming when we have a big data size. So also at this point, you guys probably can see a survey on your screen, it would be great if you could share the method or idea you have been using to fill nulls, thanks.
This is a highlight or overview of our data pipeline, so all our data are stored on AWS suite in parquet format, then the first, we use Apache Sedona to do a spatial neighbor discovery and at this step, we can feel majority of the nulls by leveraging the neighboring information. Only now, we didn’t feel during this time, we were passed to the next stack, which we use a Spark ML library to view our models. And then, this AI model now fills nulls from step one. Then in the meantime, we’ll use MLflow to do model management deployment. Now we push our data to the LENS platform, and client will be able to query the data directly.
But let’s start with neighbor discovery. So on the right, you can see our visualization file, our deep field of platform, which is a tool to display all your wells. So in this case, each line represents one audio file. Under the goal of this work, is try to discover all the neighbors on the distance between wells.
The major challenge we have was first, the data sets was too big. So in this data set, we have about 314000 wells, an inch well hides more than a hundred data points. There’s a lot data under the compute time, it’s very long on the regular thing, the machine, but a lot of times we couldn’t even finish the work because we run out of memory. So then we start using Apache Sedona, which is also called GeoSpark. It’s a distributed framework of processing big spatial data. So one of the main algorithms we used here is a KDB-tree, which is a geometric approach, basically it’s subsequently dividing into an n-dimensional space and it reveals a tree structure. And we can already have points based on [inaudible]
This is a snapshot of our code for the Sedona part. One of the things we need to do is set up special spark compacts. And I registered the spark compacts to the GeoSpark SQL engine, then when we start loading our data. After loading the data, we need to create a geometry object column. Here we use the latitude and longitude, so it’s a point object. Then we convert this data set to our spatial RDD. In the meantime, we need to create a circle RDD. The circle RDD is basically, we are drawing a circle around a point, but with given radius. After we create a both RDDs, we start data partition by using KDB tree, then we perform a distance join. So this drawing, basically we put a circle RDD and put all of it together on a point falls inside a circle where we can see the neighbor. At the end, we can convert those RDD back to data for easy processing.
This is simply by all of our data science. On the left, the left-geometry, which represents one well, you can see on the left-geometry, all the nearby wells, and also we have [inaudible] around that, between them. So at this step, we leverage the neighboring information to start building the null values. So we tried to average, weighted average, regression models and we found out the weighted average is very good result. So honestly, we didn’t fail here, we were passed to the next step is using a collaborated AI model to build it.

Yanyan Wu: Thanks Chao for recording the audio portion. As Chao mentioned, the first step we use a GeoSpark on discovering the neighboring information to fill the nulls with the neighboring information. And because there’s a constrained distance that you can use for search the neighbors so after the first step, there’s are still some null values that are not filled. So we pass on those information to the second step, which is using collaborative AI to fill the rest of the nulls.
This is, actually, is not a new methodology, we leverage a problem massive collaborative filling AI for a movie recommendations or some recommendation, people do that all the time. The model that we use leverage is called ALS, Alternating Least Squares model from Spark and now lib. If you go through this link that we provided, they have code symbols, how we used that library. By leveraging that existing Spark MLlib library that save us a lot of time and effort, but using that model, we have a map for the data format required by model inputs.
In this model, they have three columns you have the shape your data to. One is user ID, which is corresponding to object or object with ID. And then the item, in this case, we map the attribute names to those item names. And then the last one is rating, basically we converted attribute value as rating. And then if everybody remember their linear algebra class, actually fundamentals for those recommended system is singular value decomposition that I illustrated conceptually on the bottom of this page.
This is the screenshot of the code we use for using the collaborative AI and by leveraging Spark MLlib ALS model to fill the nulls. The first step we did is, actually, is just to set up the ML pipeline. So in this case, we caught the ALS model as the first stage, and then what we found out that it doesn’t matter if you normalize or not normalize the value. So that’s the reason that we don’t have other stages in this pipeline, which is where it simplified a version.
And then the second step is you have to step specify your evaluator, in this case, we use a regression evaluator, and then in order for us to optimize the model, we use a grid search to search parameter of space. And so that’s the next step. And then the last one is, in order to generalize the model, prevent overfilling, we use a cross validation to optimize the parameter. Of course, you can use hyperopt, those are newer library that we conduct the same evaluation or cross validation progress. But in this case, we use a cross validation. But one thing that we didn’t get chance to put it here, actually we utilize is, it’s very helpful to set up parallelism parameters in the cost value data, that will help you to save a significant amount of training time.
As we mentioned before, by leveraging existing MLlib ALS model, the requirements like equal data, has to be shaped into three columns, which I showed in the bottom right on these slides. So if you, if you can take a look at the left bottom of the data frame, this is a sample data that we use, actually, each role we’re presenting a well ID, and then their parameter such as longitude, latitude, lateral lengths and vertical gaps. So this is the regular PySpark data frame format and then the top, a screenshot, it shows a simple code we use. We converted the left bottom on the table to the right bottom table. And then, the code is actually very simple, you just to have use a for loop, covert the data from the left to the right format that required by using ALS model. And then you can see after the conversion, the object is mapping to the user ID, and then the attribute names has become the item names, and then the attribute values become the ratings.
In this work, we have a new leverage MLflow built inside Databricks for model management and revision control. If you, apologize for the low resolution, but if you log into Databricks on the left, the manual bar, you can find the model icon you can click on to and where you can find all the models you have trained. For each model, and then there’s different versions. As you can see the example here is, we have 11 versions of a model, and then we specify the version 10 as our production model. And then that’s a stage we set for a model two. And then of course, there’s other stages like staging archive or [inaudible] stage at all. But if you select one of the version of the model, the production model, and then on the lower portion of the slides, you can just call this simple line of code to load your production model, so it’s very easy to use and then it’s easier for you to manage the version of your models, as well as retrieve the model when you do a prediction based on the model.
As Chao mentioned, we have a LENS platform, and that was the reason that we require us to improve the data completeness. And on the right application is showing the 3D wells. And then the improved data sets was feeding the system to view, we call deep view, to view the wells in 3D. And the data sets we have, as Chao mentioned before, we have 314000 wells, and each of them has more than 20 attributes with missing values. The first step, as Chao mentioned, is a neighbor discovering process. And then with that, we use GeoSpark to carry out this process. And it takes less than 10 minutes to generate 144 minute neighbor combinations, massive combinations.
And without the GeoSpark, a scenario this large is not possible to implement that. And then often neighbor discovering information was finished, we use the neighboring information to fill the nulls. Our data sets in this case for this exercise, was for experimental purpose, we had 36% of the data, y-attributes we call vertical depths, has the null value. And then after this first step, neighbor discovery fill the nulls, that null percent reduced to 9.5%.
Another parameter, we started this data sets, purposely was 46%, 46% null percentage and after the first step, the null percentage reduced to 14%. And then we pass this data to the next step is collaborative AI to fill the nulls. And then after we shape the data, right, it has 4.7 million records that we use, and then we use 80% as a training and 20% for testing. And the model takes about four, five minutes to train with a grid search cross validation data on Databricks. And then after this step, the null percent reduced to 0%. And then the error we got from this process, we’re able to measure for the second stage, and then we found out it’s about 7% to 18% error for key attributes so it’s very decent results.
So our approach, I think is it’s a very helpful for fill the nulls for a big data, and then with distributed computing. And we want like to share some tips we learned from our exercise. And then of course, as I said, it’s that was always in detail. So when you implement it, there’s something we want to share… We would like to share the lessons we learned and hopefully that help you, if you want to try out our approach and then some of the tips you might want to keep in mind. First of all, for any AI model, if you want to get up with 20 results, first of all, you make sure that you remove outliers in your training dataset, otherwise the model is not going to be generic enough to provide the good prediction data. And also in our exercise, we did experiment between normalize the value and not normalizing and turns out, there’s doesn’t make any difference. So that, you can keep that in mind, it’ll save you some time by not normalizing the data.
And also there’s always a lot of noise in that data. In all data sets, there’s a huge amount of noise in the data. If you want a collaborate AI to predict your null values, and then if you treat each individual object, individually, as a mapping to the user ID, same as the object ID you have. And then probably, what you predict is predicting those noise. So in order to deal with noise, what we found a very effective way to do it is to group that object into user object groups, or user ID groups. So in this way that you can effectively with reducing noise in data and then the predicting of modeling results is much better.
And also we found out, if you have more attributes, the model will lead to a higher accuracy. And then the last one is, due to the noise in the data, if you want to improve the accuracy from the model, you want to make sure that the noise is built properly and also found out there are some adjectives. For example, either you take some measure to reduce the noise, the noise is still there. For example, especially that the derived attributes, for example, the lengths, the height, is independent parameters. So those, the model will provide higher accuracy than the other ones. For example, the sum of the total depths, right? Or sum of the total lengths. So this is dependent parameters, they have higher noise in the data so the prediction result tends to be lower than the independent attributes.
Yeah, we would like to hear from you, if you have any questions. Before that, we would like to ask a favor from you. Hopefully everybody enjoyed the talk and we really hope our talk can inspire a lot of new ideas and new works from you guys. Please provide feedback for us, and we’d love to hear from you. If you can, please give us the highest ratings for this session. Of course, our rating is just a number, but most important as I said, we would like to hear your thoughts and your ideas and inspiration from you and what you think that we can do more about this work. And again, back to a question section, please feel free to share a question you may have or online ask also your questions and thank you so much for your time and you attention, thanks.

Chao Yang

Chao Yang is an avid big data professional, focusing on big data engineering and applying machine learning/deep learning technologies in solving business and engineering problems. He started his caree...
Read more

Yanyan Wu

Yanyan Wu is an innovative technology leader who had years of engineering design, R&D management, product portfolio management, and software development experience before shifting to data science worl...
Read more