Audience Modeling With Apache Spark ML Pipelines

This is a guest blog from Eugene Zhulenev on his experiences with Engineering Machine Learning and Audience Modeling at Collective.

At Collective, we heavily rely on machine learning and predictive modeling to run our digital advertising business. All decisions about what ad to show at this particular time to this particular user are made by machine learning models (some of them are in real-time while some of them are offline).

We have a lot of projects that uses machine learning, the common name for all of them can be Audience Modeling, as they all are trying to predict audience conversion (CTR, Viewability Rate, etc…) based on browsing history, behavioral segments, and other type of predictors.

For most of our new development, we use Apache Spark and MLLib. However, while it is an awesome project, we found that there are some widely used tools and libraries that are missing in Spark. To add those missing features that we would really like to have in Spark, we created Spark Ext (Spark Extensions Library).

I’m going to show a simple example of combining Spark Ext with Spark ML pipelines for predicting user conversions based on geo and browsing history data.

Predictors Data

I’m using a dataset with 2 classes, that will be used for solving classification problem (user converted or not). It’s created with dummy data generator so that these 2 classes can be easily separated. It’s pretty similar to real data that usually is available in digital advertising.

Browsing History Log

History of websites that were visited by a user.

Geo Location Log

Latitude/Longitude impression history.

Transforming Predictors Data

As you can see the predictors data (sites and geo) is in long format. Each cookie has multiple rows associated with it; in general, it is not a good fit for machine learning. We’d like cookie to be a primary key while all other data should form the feature vector.

Gather Transformer

Inspired by R tidyr and reshape2 packages, we convert a long DataFrame with values for each key into a wide DataFrame and apply an aggregation function if the single key has multiple values.

val gather = new Gather()
.setPrimaryKeyCols("cookie")
.setKeyCol("site")
.setValueCol("impressions")
.setValueAgg("sum") // sum impression by key
.setOutputCol("sites")
val gatheredSites = gather.transform(siteLog)

Google S2 Geometry Cell Id Transformer

The S2 Geometry Library is a spherical geometry library, very useful for manipulating regions on the sphere (commonly on Earth) and indexing geographic data. Basically, it assigns a unique cell id for each region on the earth.

For example, you can combine S2 transformer with Gather to convert lat/lon to K-Vpairs, where the key will be S2 cell id. Depending on a level you can assign all people in Greater New York area (level = 4) into one cell, or you can index them block by block (level = 12).

// Transform lat/lon into S2 Cell Id
val s2Transformer = new S2CellTransformer()
.setLevel(5)
.setCellCol("s2_cell")

// Gather S2 CellId log
val gatherS2Cells = new Gather()
.setPrimaryKeyCols("cookie")
.setKeyCol("s2_cell")
.setValueCol("impressions")
.setOutputCol("s2_cells")

val gatheredCells = gatherS2Cells.transform(s2Transformer.transform(geoDf))

Audience Modeling With Apache Spark ML Pipelines

Predictors Data

Browsing History Log

Geo Location Log

Transforming Predictors Data

Gather Transformer

Google S2 Geometry Cell Id Transformer

Assembling Feature Vector

Gather Encoder

Spark ML Pipelines

Put It All together with Spark ML Pipelines

Conclusion

What's next?

Innovators Unveiled: Announcing the Databricks Generative AI Startup Challenge Winners!

Introducing Databricks Generative AI Partner Accelerators and RAG Proof of Concepts

Predictors Data

Browsing History Log

Geo Location Log

Transforming Predictors Data

Gather Transformer

Google S2 Geometry Cell Id Transformer

Assembling Feature Vector

Gather Encoder

Spark ML Pipelines

Put It All together with Spark ML Pipelines

Conclusion

Never miss a Databricks post

Sign up

What's next?

Innovators Unveiled: Announcing the Databricks Generative AI Startup Challenge Winners!

Introducing Databricks Generative AI Partner Accelerators and RAG Proof of Concepts