# Classifying Documents with Poisson Mixtures

### Abstract

Although the Poisson distribution and two well-known Poisson mixtures (the negative binomial and K-mixture distributions) have been utilized as tools for modeling texts for over last 15 years, the application of these distributions to build generative probabilistic text classifiers has been rarely reported and therefore the available information on applying such models to classification remains fragmentary and even contradictory. In this study, we construct generative probabilistic text classifiers with these three distributions and perform classification experiments on three standard datasets in a uniform manner to examine the performance of the classifiers. The results show that the performance is much better than that of the standard multinomial naive Bayes classifier if the normalization of document length is appropriately taken into account. Furthermore, the results show that, in contrast to our intuitive expectation, the classifier with the Poisson distribution performs best among all the examined classifiers, even though the Poisson model gives a cruder description of term occurrences in real texts than the K-mixture and negative binomial models do. A possible interpretation of the superiority of the Poisson model is given in terms of a trade-off between fit and model complexity.### References

. K. Church, W. A. Gale, Poisson Mixtures, Natural Language Engineering 1 (1995) 163--190.

. K. Church, W. A. Gale, Inverse Document Frequency (IDF): A Measure of Deviations from Poisson, in: Proceedings of the Third Workshop on Very Large Corpora, 1995, pp. 121--130.

. H. Ogura, H. Amano, M. Kondo, Feature selection with a measure of deviations from Poisson in text categorization, Expert Systems with Applications 36 (2009) 6826--6832.

. H. Ogura, H. Amano, M. Kondo, . Distinctive characteristics of a metric using deviations from Poisson for feature selection, Expert Systems with Applications 37 (2010) 2273--2281.

. H. Ogura, H. Amano, M. Kondo, Comparison of metrics for feature selection in imbalanced text classification, Expert Systems with Applications 38 (2011) 4978--4989.

. S. Katz, Distribution of content words and phrases in text and language modelling, Natural Language Engineering 2 (1996), 15--59.

. J. Gao, M. Li, K. Lee, N-gram distribution based language model adaptation, in: ICSLP2000 Proceedings of International Conference on Spoken Language Processing, 2000, pp. 497--500.

. J. Gao, K. Lee, Distribution-based pruning of backoff language models. in: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, 2000, pp. 579--588.

. Y. Gotoh, S. Renals, Variable Word Rate N-grams, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Vol. 3, 2000, pp. 1591--1594.

. M. Jansche, Parametric Models of Linguistic Count Data, in: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, 2003, pp. 288--295.

. M. Saravanan, S. Raman, B. Ravindran, A Probabilistic Approach to Multi-document Summarization for Gererating a Tiled Summary, in: ICCIMA '05 Proceedings of the Sixth International Conference on Computational Intelligence and Multimedia Applications, 2005, pp. 167--172.

. M. Saravanan, B. Ravindran, S. Raman, Improving Legal Document Summarization Using Graphical Models, in: Proceedings of the 2006 conference on Legal Knowledge and Information Systems: JURIX 2006: The Nineteenth Annual Conference, 2006, pp. 51--60.

. H. Pande, H. S. Dhami, Distributions of different parts of speech in different parts of a text and in different texts, The modern journal of applied linguistics, ISSN 0974-8741 (2010) 152--170.

. S. Kim, H. Seo, H. Rim, Poisson Naive Bayes for Text Classification with Feature Weighting. in: International Workshop on Information Retrieval with Asian Languages, 2003, pp. 33-40.

. S. Kim, K. Han, H. Rim, H. Myaeng, Some effective techniques for naive Bayes text classification, IEEE transactions on knowledge and data engineering 18 (2006) 1457--1466.

. S. Eyheramendy, D. Lewis, D. Madigam, On the Naive Bayes Model for Text Categorization, in: Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, 2003, pp. 332--339.

. E. M. Airoldi, W. Cohen, S. E. Fienberg, Statistical Models for Frequent Terms in Text, Tech. Report CMU-CALD-04-106, School of Computer Science, Carnegie Mellon Univ. (2004).

. E. M. Airoldi, A. G. Anderson, S. E. Fienberg, K. K. Skinner, Who Wrote Ronald Reagan's Radio Addresses?, Bayesian Analysis 1 (2006) 289--320.

. T. Mitchell, Machine Learning, McGraw Hill, 1997.

. R. E. Madsen, D. Kauchak, C. Elkan, Modeling word burstiness using the Dirichlet distribution, in: Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 545--552.

. S. Clinchant, E. Gaussier, The BNB distribution for text modeling, in: Advances in Information Retrieval. 30th European Conference on IR Research, 2008, pp. 150--161.

. B. Allison, An improved hierarchical Bayesian Model of Language for document classification, in: Proceedings of the 22nd International conference on computational linguistics, 2008, pp. 25--32.

. H. Ogura, H. Amano, M. Kondo, Gamma-Poisson Distribution Model for Text Categorization, ISRN Artificial Intelligence Vol. 2013, (2013) Article ID 829630, http://dx.doi.org/10.1155/2013/829630.

. K. Lang, NewsWeeder: Learning to Filter Netnews, in: Proceedings of the 12th International Machine Learning Conference, 1995, pp. 331--339, Morgan Kaufmann.

. N. Slonim, G. Bejerano, S. Fine, N. Tishby, Discriminative feature selection via multiclass variable memory Markov model, Journal on Applied Signal Processing, 2 (2003) 93--102.

. J. Kuha, AIC and BIC - Comparisons of Assumptions and Performance, Sociological Methods and Research, 33 (2004) 188--229.

. S. Konishi, G. Kitagawa, Information Criteria and Statistical Modeling, Springer, 2007.

. M. P. Burnha, D. R. Anderson, Multimodel Inference, Sociological Methods and Research 33 (2004) 261--304.

. M. Ye, P. D. Meyer, S. P. Neuman, On model selection criteria in multimodel analysis, Water Resources Research, 44 (2008) doi:10.1029/2008WR006803.

. P. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison-Wesley, 2006.