Perplexity on Reduced Corpora - Yahoo! JAPAN R&D

Publications

CONFERENCE (INTERNATIONAL) Perplexity on Reduced Corpora

Hayato Kobayashi

the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014)

June 01, 2014

This paper studies the idea of removing low-frequency words from a corpus, which is a common practice to reduce computational costs, from a theoretical standpoint. Based on the assumption that a corpus follows Zipf’s law, we derive trade-off formulae of the perplexity of k-gram models and topic models with respect to the size of the reduced vocabulary. In addition, we show an pproximate behavior of each formula under certain conditions. We verify the correctness of our theory on synthetic corpora and examine the gap between theory and practice on real corpora.

Slides Download (1.1MB)

Natural Language Processing
Machine Learning

PDF : Perplexity on Reduced Corpora