Target-Topic Aware Doc2Vec for Short Sentence Retrieval from User Generated Content
Kosuke Kurihara*, Yoshiyuki Shoji*, Sumio Fujita, Martin J. Dürst* (* Aoyama Gakuin University)
The 21st International Conference on Information Integration and Web-based Applications & Services, 2019/12
自然言語処理 (Natural Language Processing) 情報検索 (Information Retrieval) データサイエンス (Data Science)
- This paper proposes a new method of supplementing the context of short sentences for the training phase of Doc2Vec. Since CGM (Consumer Generated Media) sites and SNS sites become widespread, the importance of similarity calculation between a given query and a short sentence is increasing. As an example, a search by the query “sad” should find actual expressions such as “I needed a handkerchief” on a movie review site. Doc2Vec is one of the most widely used methods for vectorization of queries and sentences. However, Doc2Vec often exhibits low accuracy if the training data consists of short sentences, because they lack context. We modified Doc2Vec with the hypothesis that other posts for the same topic (i.e. reviews for the same movie in an online movie review site) may share the same background. Our method uses target-topic IDs instead of sentence IDs as the context in the training phase of the Doc2Vec with the PV-DM model; this model estimates the next term from a few previous terms and context. The model trained with item IDs vectorizes a sentence more accurately than a model trained with sentence IDs. We conducted a large-scale experiment using 1.2 million movie review posts and a crowdsourcing-based evaluation. The experimental result demonstrates that our new method achieves higher precision and nDCG than previous Doc2Vec variants and traditional topic modeling methods.
Target-Topic Aware Doc2Vec for Short Sentence Retrieval from User Generated Content（外部サイト／External Site Link）