Learning to Rank Datasets for Search

Download Slides

Learning to rank methods automatically learn from user interaction instead of relying on labeled data prepared manually. Learning to rank, also referred to as machine-learned ranking, is an application of reinforcement learning concerned with building ranking models for information retrieval. Learning to rank has been successfully applied in building intelligent search engines, but has yet to show up in dataset search.

Dataset search is ripe for innovation with learning to rank specifically by automating the process of index construction. Oscar will recap previous presentations on dataset search and introduce learning to rank as a way to automate relevance scoring of dataset search results. He will also give a demo of a dataset search engine that makes use of an automatically constructed index using learning to rank on Elasticsearch and Spark.

Oscar will explain the motivation and use case of learning to rank in dataset search focusing on why it is interesting to rank datasets through machine-learned relevance scoring and how to improve indexing efficiency by tapping into user interaction data from clicks. Dataset Search and Learning to Rank are IR and ML topics that should be of interest to Spark Summit attendees who are looking for use cases and new opportunities to organize and rank Datasets in Data Lakes to make them searchable and relevant to users.

In preparation for this talk it is recommend that attendees watch previous two talks on dataset search from prior Spark Summit events as they build up to the present talk:

[1] https://spark-summit.org/east-2017/events/building-a-dataset-search-engine-with-spark-and-elasticsearch/

[2] https://spark-summit.org/eu-2016/events/spark-cluster-with-elasticsearch-inside/

Session hashtag: #SAISDS8

« back
About Oscar Castañeda-Villagrán

Oscar studied Computer Science at Delft University of Technology. He’s now Data Scientist at Xoom a PayPal service. Oscar is interested in Data Management, Dataset Search, Online Learning to Rank, and Apache Spark.