Challenges of Multileaved Comparison in Practice: Lessons from NTCIR-13 OpenLiveQ Task

Makoto P. Kato (Kyoto Univ.), Tomohiro Manabe, Sumio Fujita, Akiomi Nishida and Takehiro Yamamoto (Kyoto Univ.)

The 27th ACM International Conference on Information and Knowledge Management (CIKM 2018) short paper, 2018/10


Information Retrieval

This paper discusses challenges of an online evaluation technique, multileaved comparison, based on the analysis of evaluation results in a community question-answering (cQA) search service. NTCIR-13 OpenLiveQ task offered a shared task in which participants addressed an ad-hoc retrieval task in a cQA service, and evaluated their rankers by multileaved comparison, which combines multiple rankings to generate a single search result page, and simultaneously evaluates the different rankings based on users’ clicks on the search result page. Since the number of search result impressions during the evaluation period might not suffice to evaluate a hundred of rankers, we conducted the online evaluation only for rankers that achieved high performance in offline evaluation. The analysis of evaluation results showed that offline and online evaluation results did not fully agree, and a large number of users’ clicks were necessary to find a statistically significant difference for every ranker pair. To cope with these problems in large-scale multileaved comparison, we propose a new experimental design that evaluates all the rankers online but intensively tests only the top-k rankers. Simulation-based experiments demonstrated that Copeland counting algorithm could achieve high top-k recall in the top-k identification problem for multileaved comparison.

Challenges of Multileaved Comparison in Practice: Lessons from NTCIR-13 OpenLiveQ Task(External Site Link)