Target-Topic Aware Doc2Vec for Short Sentence Retrieval from User Generated Content - Yahoo! JAPANの研究開発

Publications

カンファレンス (国際) Target-Topic Aware Doc2Vec for Short Sentence Retrieval from User Generated Content

Kosuke Kurihara*, Yoshiyuki Shoji*, Sumio Fujita, Martin J. Dürst* (* Aoyama Gakuin University)

The 21st International Conference on Information Integration and Web-based Applications & Services (iiWAS2019)

2019.12.2

This paper proposes a new method of supplementing the context of short sentences for the training phase of Doc2Vec. Since CGM (Consumer Generated Media) sites and SNS sites become widespread, the importance of similarity calculation between a given query and a short sentence is increasing. As an example, a search by the query “sad” should find actual expressions such as “I needed a handkerchief” on a movie review site. Doc2Vec is one of the most widely used methods for vectorization of queries and sentences. However, Doc2Vec often exhibits low accuracy if the training data consists of short sentences, because they lack context. We modified Doc2Vec with the hypothesis that other posts for the same topic (i.e. reviews for the same movie in an online movie review site) may share the same background. Our method uses target-topic IDs instead of sentence IDs as the context in the training phase of the Doc2Vec with the PV-DM model; this model estimates the next term from a few previous terms and context. The model trained with item IDs vectorizes a sentence more accurately than a model trained with sentence IDs. We conducted a large-scale experiment using 1.2 million movie review posts and a crowdsourcing-based evaluation. The experimental result demonstrates that our new method achieves higher precision and nDCG than previous Doc2Vec variants and traditional topic modeling methods.

Paper : Target-Topic Aware Doc2Vec for Short Sentence Retrieval from User Generated Content （外部サイト）