About Nube Technologies
Nube Technologies builds business applications to better decision making through better data. Nube’s fuzzy matching product Reifier helps companies get a holistic view of enterprise data. By linking and resolving entities across various sources, Reifier helps optimize the sales and marketing funnel, promotes enhanced security and risk management and better consolidation and reporting of business data. We help our customers build better and effective models by ensuring that their underlying master data is accurate.
Why Apache Spark
Data matching within a single source or across sources is a very core problem faced by almost every enterprise and we wanted to create a really smart way to solve this. Solving data matching problems is made even more difficult given that most data suffers from poor data quality with foremost reasons being errors and omissions during data collection, multi-field records and large data sizes.
The problem is an inherently quadratic problem, and although there are techniques to reduce the number of comparisons and boost up speed, applying them intelligently to unknown data is a challenging problem. While building Reifier, our aim is to be able to deal with various kinds of data in different domains be it customer information, product catalogs, organizations or any other variety of data.
We also wanted to build a system that was lightening fast as well as massively scalable with respect to the huge volumes of data seen by the modern day enterprise. On the development side, our wishlist included a friendly API, robust and scalable architecture, easy to use and well documented framework and inbuilt job dependency management.
How we use Spark
When we evaluated Spark, we were blown away by its speed, power and functionality. Spark’s support for machine learning helped us create a supervised learning product which can completely learn combined similarity rules across different fields of a record from labeled positive and negative samples. We can hence use the same product across different data types easily.
Our algorithms sit atop the base Spark framework and using the custom partitioning by Spark, many times we compare only less than 0.5% of all possible pairs, which is a big performance boost. Our commitment to Spark was bolstered when Reifier got certified on Spark.
Nube and Spark Going Forward
Using Spark has clearly been the best architecture decision we took, and we are very happy to be part of the thriving Spark community. We are now looking forward to exploiting other Spark functionality to provide real time distributed fuzzy matching.