Building a Scalable Record Linkage System with Apache Spark, Python 3, and Machine Learning

Download Slides

MassMutual has hundreds of millions of customer records scattered across many systems. There is no easy way to link a given customer’s information across all these systems to build a comprehensive customer profile. Building such a profile has important applications in many areas of MassMutual’s business, from marketing to underwriting.

To address this issue, MassMutual built Splinkr, an internal solution that links customer records across these disparate systems in a flexible and scalable way.

In this talk we will share our experience building Splinkr with Apache Spark, Python 3, and simple machine learning techniques. We’ll cover the good parts of our experience working with this stack as well as the bad, from working with clean APIs and readily available libraries to dealing with nasty Spark bugs, deployment difficulties, and bad training data.

Session hashtag: #Py6SAIS



« back
About Nicholas Chammas

Nick has been working with (and occasionally contributing to) Apache Spark since 0.9.0. He currently works on the Data Engineering team at MassMutual and has had stints at Databricks, the Recurse Center, and Turbine/WB Games. Nick is also the author of Flintrock, a command-line tool for launching Apache Spark clusters on EC2.

About Edward Pantridge

Edward (Eddie) is a data scientist and artificial intelligence researcher specializing in genetic programming and neroevolution. During his time with the data science team at MassMutual he has developed predictive models and interactive data visualizations on behalf of business stakeholders. As an active member of the Hampshire College Computational Intelligence Laboratory he was built software which utilizes machine learning for program synthesis.