Jacques Doux is currently datascientist at Elsevier working on entity recognition. He focusses on content deduplication, as well as content sharing copyright compliance.nHe developed strong domain knowledge as well as technical expertise since he joined Elsevier in 2011 as part of marketing and then webanalytics teams.nBefore Joining Elsevier, Jacques trained as a structural biochemist with a special interest with proteins and lipids interactions.
October 15, 2019 05:00 PM PT
A recommender story: improving backend data quality while reducing costsnInformation overload is one of the biggest challenges academics face on a daily basis while finding the right knowledge to advance science. With around 7k research articles being published every day, how do you find the right ones?
Elsevier is a global information analytics business that helps institutions and professionals advance healthcare, open science and improve performance. With many data sources and signals being available, data science and big data engineering provide the perfect opportunity to deliver more value to researchers.
Here we will focus on Mendeley, an open (free of charge) academic content platform to help researchers discover new information via functionalities such as a crowd sourced collection of academic related documents (Catalogue) and various personalized recommender systems. MendeleySuggest, the recommender system, helps millions of researchers worldwide to find documents and people relevant to their research field, they did not yet know exist. The personalised recommenders are powered by Mendeley Catalogue, clustering 2 billion records correctly into canonical records, state of the art algorithms and big data solutions (e.g. Spark).
In the past few years, we noticed that with our content growth, quality of the canonical records started drifting due to scalability issues. As a result, we faced clustering accuracy problems and, in turn, impacting also the recommenders. In this talk we will highlight how we rearchitected the fabrication of Mendeley Catalogue to improve its scalability and accuracy. In addition, we will show how the migration from Hadoop Map Reduce to Spark has helped us reduce costs as well as improving maintainability.