Daniel van der Ende is currently a data engineer in the ING Wholesale Banking Advanced Analytics team. Here, he works on high performance distributed computation with Spark, empowering data scientists by helping them run their models on very large datasets performantly. He is an Apache Spark and Apache Airflow contributor. Before his work at ING, he studied Computer Science at Delft University of Technology where he obtained his MSc. in 2015.
June 5, 2018 05:00 PM PT
ING bank is a Dutch multinational, multi-product bank that offers banking services to 33 million retail and commercial customers in over 40 countries. At this scale, ING naturally faces a multitude of data consolidation tasks across its disparate sources. A common consolidation problem is fuzzy name matching: given a name (streaming) or a list of names (batch), find out the most similar name(s) from a different list.
Popular methods such as Levenshtein distance are not appropriate because of the time complexity and sheer volume of names involved. In this talk, we will introduce how we use a Spark custom ML pipeline and Structured Streaming to build fuzzy name matching products in batch and streaming. This can successfully match 8000 names per second against a 10 million name list, using a ten-node cluster. Firstly, we will give an introduction into the name matching problem.
Secondly, we will explain why Levenshtein distance approach is limited, and demonstrate a faster approach; token-based cosine similarity matching. Next, we will show how a ML pipeline helps to build an elegant solution. Here, we will deep dive into the detail of each stage, including customized preprocessing, tokenization, term-frequency, customized inverse document frequency, customized cosine similarity with distributed sparse matrix multiplication, and a customized supervision stage.
Finally, we will show how we deploy the ML pipeline within a batch data pipeline, and additionally as a fuzzy search engine in a streaming manner. Â The main conclusions will be: (1) a spark custom ML pipeline provides a powerful way to handle complicated data science problems (2) a uniform ML pipeline can serve both batch and streaming products easily from the same codebase.
Session hashtag: #MLSAIS17