topicexplorer prep

Prepares a corpus for modeling with stoplist management tools.

Stoplisting

What is stoplisting?

Extremely common words can be of little value in discriminating between documents and can create uninterpretable topics. Terms like the, of, is, and, but, and or are thus excluded from modeling. These terms are called stop words. The process of removing stop words is stoplisting.

How topicexplorer prep generates stoplists

topicexplorer prep generates stoplists from the frequencies of words in the collection being modeled, rather than using the same list of words across different collections. Arbitrary lists can still be excluded with the --stopword-file argument.

While most natural language processing (NLP) tools exclude common words, topicexplorer prep also provides functionality to remove low-frequency words.

The contribution of low-freuqency words to the probability distribution is negligible – if a word only occurs once in a 1 million word corpus (which can easily be hit with only 25-50 volumes), then it has a .000001 probability of occurring. The runtime improvements gained from excluding these low frequency words from the word-topic matrix far outweigh the marginal improvements to model fit.

Another benefit of removing low-frequency words is the removal of spurious tokens introduced by optical character recognition (OCR) in scanned documents.

Finally, very small words can be excluded with the --min-word-len argument. These small words often appear when mathematical formulas are in a text (e.g., y = mx + b would introduce y, mx, and b). Usually, they will be caught by the low-frequency filters, but this ensures they are left out.

See also

Introduction to Information Retrieval – stop words
Stanford textbook on stop words.

Command Line Arguments

High-probability words (--high-percent)

Remove common words from the corpus, accounting for up to HIGH_PERCENT of the total occurrences in the corpus.

Recommended, but not default: --high-percent 50

Low-probability words (--low-percent)

Remove uncommon words from the corpus, accounting for up to LOW_PERCENT of the total occurrences in the corpus.

Recommended, but not default: --low-percent 10

Small words (--min-word-len)

Remove words with few characters from the corpus. Often includes mathematical notation and OCR errors.

Recommended, but not default: --min-word-len 3

Custom stopwords (-stopword-file)

Remove custom words from the corpus.

Quiet mode (-q)

Suppresses all user input requests. Uses default values unless otherwise specified by other argument flags. Very useful for scripting automated pipelines.