topicexplorer prep
¶
Prepares a corpus for modeling with stoplist management tools.
Stoplisting¶
What is stoplisting?¶
Extremely common words can be of little value in discriminating between
documents and can create uninterpretable topics. Terms like the
, of
,
is
, and
, but
, and or
are thus excluded from modeling. These
terms are called stop words. The process of removing stop words is
stoplisting.
How topicexplorer prep
generates stoplists¶
topicexplorer prep
generates stoplists from the frequencies of words in the
collection being modeled, rather than using the same list of words across
different collections. Arbitrary lists can still be excluded with the
--stopword-file
argument.
While most natural language processing (NLP) tools exclude common words,
topicexplorer prep
also provides functionality to remove low-frequency
words.
The contribution of low-freuqency words to the probability distribution is negligible – if a word only occurs once in a 1 million word corpus (which can easily be hit with only 25-50 volumes), then it has a .000001 probability of occurring. The runtime improvements gained from excluding these low frequency words from the word-topic matrix far outweigh the marginal improvements to model fit.
Another benefit of removing low-frequency words is the removal of spurious tokens introduced by optical character recognition (OCR) in scanned documents.
Finally, very small words can be excluded with the --min-word-len
argument.
These small words often appear when mathematical formulas are in a text (e.g.,
y = mx + b
would introduce y
, mx
, and b
). Usually, they will be
caught by the low-frequency filters, but this ensures they are left out.
See also
- Introduction to Information Retrieval – stop words
- Stanford textbook on stop words.
Recommended Settings¶
Each argument has a suggested value. A quick start, assuming your corpus is in a folder called “workset” is:
topicexplorer prep workset --high-percent 50 --low-percent 10 --min-word-len 3 -q
These parameters work well for English-language text. For languages without
articles (e.g., “a”, “the”), we recommend reducing the --high-percent
argument to --high-percent 25
.
Command Line Arguments¶
High-probability words (--high-percent
)¶
Remove common words from the corpus, accounting for up to HIGH_PERCENT
of
the total occurrences in the corpus.
Recommended, but not default: --high-percent 50
Low-probability words (--low-percent
)¶
Remove uncommon words from the corpus, accounting for up to LOW_PERCENT
of
the total occurrences in the corpus.
Recommended, but not default: --low-percent 10
Small words (--min-word-len
)¶
Remove words with few characters from the corpus. Often includes mathematical notation and OCR errors.
Recommended, but not default: --min-word-len 3
Custom stopwords (-stopword-file
)¶
Remove custom words from the corpus.
Quiet mode (-q
)¶
Suppresses all user input requests. Uses default values unless otherwise specified by other argument flags. Very useful for scripting automated pipelines.