Nicholas Chammas

Data Engineer, MassMutual

Nick has been working with (and occasionally contributing to) Apache Spark since 0.9.0. He currently works on the Data Engineering team at MassMutual and has had stints at Databricks, the Recurse Center, and Turbine/WB Games. Nick is also the author of Flintrock, a command-line tool for launching Apache Spark clusters on EC2.

Past sessions

MassMutual has hundreds of millions of customer records scattered across many systems. There is no easy way to link a given customer’s information across all these systems to build a comprehensive customer profile. Building such a profile has important applications in many areas of MassMutual’s business, from marketing to underwriting.

To address this issue, MassMutual built Splinkr, an internal solution that links customer records across these disparate systems in a flexible and scalable way.

In this talk we will share our experience building Splinkr with Apache Spark, Python 3, and simple machine learning techniques. We’ll cover the good parts of our experience working with this stack as well as the bad, from working with clean APIs and readily available libraries to dealing with nasty Spark bugs, deployment difficulties, and bad training data.

Session hashtag: #Py6SAIS

Summit East 2016 Flintrock: A Faster, Better spark-ec2

February 17, 2016 04:00 PM PT

spark-ec2 is a handy little tool for spinning up Spark clusters on EC2, but it has a few frustrating problems that are difficult to solve within its current architecture. In this talk, Nick will give an overview of Flintrock, a single-purpose command-line tool for launching and interacting with Spark clusters on EC2. Flintrock is open source and aims to be the spiritual successor to spark-ec2.