Text similarity function based on word embeddings for short text analysis - Yahoo! JAPANの研究開発

Publications

カンファレンス (国際) Text similarity function based on word embeddings for short text analysis

Adrián Jiménez Pascual (Tokyo Univ.), Sumio Fujita

18th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2017)

2017.4.18

We present the Contextual Specificity Similarity (CSS) measure, a new document similarity measure based on word embeddings and inverse document frequency. The idea behind the CSS measure is to score higher the documents that include words with close embeddings and frequency of usage. This paper provides a comparison with several methods of text classification, which will evince the accuracy and utility of CSS in k-nearest neighbour classification tasks for short texts. We experimentally confirmed that CSS performed excellent in the short text classification task as have been intended, outperforming traditional methods as well as WMD, the most recently proposed method.