`topicexplorer prep`¶

Prepares a corpus for modeling with stoplist management tools.

Stoplisting¶

What is stoplisting?¶

Extremely common words can be of little value in discriminating between documents and can create uninterpretable topics. Terms like the, of, is, and, but, and or are thus excluded from modeling. These terms are called stop words. The process of removing stop words is stoplisting.

How `topicexplorer prep` generates stoplists¶

topicexplorer prep generates stoplists from the frequencies of words in the collection being modeled, rather than using the same list of words across different collections. Arbitrary lists can still be excluded with the --stopword-file argument.

While most natural language processing (NLP) tools exclude common words, topicexplorer prep also provides functionality to remove low-frequency words.

The contribution of low-freuqency words to the probability distribution is negligible – if a word only occurs once in a 1 million word corpus (which can easily be hit with only 25-50 volumes), then it has a .000001 probability of occurring. The runtime improvements gained from excluding these low frequency words from the word-topic matrix far outweigh the marginal improvements to model fit.

Another benefit of removing low-frequency words is the removal of spurious tokens introduced by optical character recognition (OCR) in scanned documents.

Finally, very small words can be excluded with the --min-word-len argument. These small words often appear when mathematical formulas are in a text (e.g., y = mx + b would introduce y, mx, and b). Usually, they will be caught by the low-frequency filters, but this ensures they are left out.

Recommended Settings¶

Each argument has a suggested value. A quick start, assuming your corpus is in a folder called “workset” is:

topicexplorer prep workset --high-percent 50 --low-percent 10 --min-word-len 3 -q

These parameters work well for English-language text. For languages without articles (e.g., “a”, “the”), we recommend reducing the --high-percent argument to --high-percent 25.

Command Line Arguments¶

High-probability words (`--high-percent`)¶

Remove common words from the corpus, accounting for up to HIGH_PERCENT of the total occurrences in the corpus.

Recommended, but not default: --high-percent 50

Low-probability words (`--low-percent`)¶

Remove uncommon words from the corpus, accounting for up to LOW_PERCENT of the total occurrences in the corpus.

Recommended, but not default: --low-percent 10

Small words (`--min-word-len`)¶

Remove words with few characters from the corpus. Often includes mathematical notation and OCR errors.

Recommended, but not default: --min-word-len 3

Custom stopwords (`-stopword-file`)¶

Remove custom words from the corpus.

Quiet mode (`-q`)¶

Suppresses all user input requests. Uses default values unless otherwise specified by other argument flags. Very useful for scripting automated pipelines.

`topicexplorer prep`¶

Stoplisting¶

What is stoplisting?¶

How `topicexplorer prep` generates stoplists¶

Recommended Settings¶

Command Line Arguments¶

High-probability words (`--high-percent`)¶

Low-probability words (`--low-percent`)¶

Small words (`--min-word-len`)¶

Custom stopwords (`-stopword-file`)¶

Quiet mode (`-q`)¶

Table Of Contents

Related Topics

This Page

topicexplorer prep¶

Stoplisting¶

What is stoplisting?¶

How topicexplorer prep generates stoplists¶

Recommended Settings¶

Command Line Arguments¶

High-probability words (--high-percent)¶

Low-probability words (--low-percent)¶

Small words (--min-word-len)¶

Custom stopwords (-stopword-file)¶

Quiet mode (-q)¶

`topicexplorer prep`¶

How `topicexplorer prep` generates stoplists¶

High-probability words (`--high-percent`)¶

Low-probability words (`--low-percent`)¶

Small words (`--min-word-len`)¶

Custom stopwords (`-stopword-file`)¶

Quiet mode (`-q`)¶