Tuesday, August 28, 2012

N-gram features for text classification

Traditionally, text classification has relied on bag-of-words count features. For some experiments, I was  wondering if using n-gram counts could make for a good feature set. Once I generated the features, I knew I was in trouble. For the WSJ corpus, I got about 20 million features for a trigram model. Just checked out the literature and found this paper that n-gram features don't help much:

A Study Using n-gram Features for Text Categorization, Johannes Furnkranz

Bigram and trigram features may give modest gains, but feature selection is obviously required. Feature selection based on document frequency, term frequency would be a simple approach.

No comments:

Post a Comment