Text similarity function based on word embeddings for short text analysis
Adrián Jiménez Pascual (Tokyo Univ.), Sumio Fujita
CICLing 2017, 2017/4
Natural Language Processing Information Retrieval Machine Learning
- We present the Contextual Specificity Similarity (CSS) measure, a new document similarity measure based on word embeddings and inverse document frequency. The idea behind the CSS measure is to score higher the documents that include words with close embeddings and frequency of usage. This paper provides a comparison with several methods of text classification, which will evince the accuracy and utility of CSS in k-nearest neighbour classification tasks for short texts. We experimentally confirmed that CSS performed excellent in the short text classification task as have been intended, outperforming traditional methods as well as WMD, the most recently proposed method.
Text similarity function based on word embeddings for short text analysis（External Site Link）