topicexplorer init
¶
This module initializes the topicexplorer. It performs tokenization of the corpus.
It creates a .ini
configuration with the settings for this instance. It also
creates a vsm.Corpus
object saved in a .npz
file. The corpus object
stores basic information about documents and the words they contain.
Tokenization and Normalization¶
Tokenization is the process of segmenting text into discrete units, or tokens.
Normalization is the process of folding equivalent tokens together.
For example, The
and the
are normalized to a single, lower-case the
token. The token Lower-case
is normalized to lowercase
with the default
tokenizer.
Note
Note that the
would commonly be “stopped” or removed from the corpus.
This occurs during the topicexplorer prep
stage.
Tokenization is language-dependent. While many languages (including English) can be tokenized by splitting on whitespace characters, other languages (such as Chinese) require more advanced techniques.
topicexplorer init
includes several tokenizers, including two
Chinese-language functions, selected with the --tokenizer
argument.
See also
- Introduction to Information Retrieval – Tokenization
- Stanford Information Retrieval book section introducing tokenization.
- Introduction to Information Retrieval – Normalization
- Stanford Information Retrieval book section introducing normalization.
Input Formats¶
topicexplorer init
can generate corpora from several types of data:
- Plain-text Files (
.txt
) - PDF files (
.pdf
) - BibTeX files (
.bib
) withfile
fields pointing to PDF files
Other types of files will need to first be converted to plain-text. We recommend pandoc as a way to convert other filetypes.
PDFs¶
PDF files will first be converted to plaintext using the pdfminer library.
This process creates a new directory with the suffix -txt
in the same
location as the corpus path.
Documents
|-- example
|-- example-txt
When serving the visualization with topicexplorer launch example --fulltext
,
the original PDFs will be served from the example
folder. If example
is
removed, the plain-text will be served from the example-txt
folder.
BibTeX¶
topicexplorer init
can build a corpus from appropriately marked-up BibTeX
files, such as those generated by Mendeley.
Each BibTeX entry will require a file = {}
directive that locates the PDF
file on the file system.
@inproceedings{Murdock2015,
author = {Murdock, Jaimie and Allen, Colin},
title = {{Visualization Techniques for Topic Model Checking}},
year = {2015}
booktitle = {2015 Association for the Advancement of Artificial Intelligence},
pages = {3},
address = {Austin, TX},
publisher = {AAAI Press},
file = {:home/jaimie/Downloads/10007-45032-1-PB.pdf:pdf},
keywords = {applications,nlp,topic modeling,visualization},
mendeley-groups = {Publications,Topic Modeling},
url = {http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/10007},
}
Plain-text will be extracted using the method described above in PDF files.
Metadata is added to the visualization using the .bib
file.
Command Line Arguments¶
Corpus Name (--name
)¶
The name of the corpus displayed in the visualizations.
Note
Make sure to surround the corpus name with quotes. For example:
topicexplorer init --name "The Example Corpus" example
Model Path (--model-path
)¶
The path to store the model files.
Defaults to a models
directory at the same level as the corpus directory.
topicexplorer init example
Documents
|-- example
|-- models
Tokenizer Selection (--tokenizer
)¶
topicexplorer
includes several tokenizers, selected with the --tokenizer
flag. They are:
- default – default tokenizer. normalizes with removed punctuation and digits. Re-hyphenates multi-character dashes by adding spaces to split words.
- simple – normalies by removing digits and a limited set of punctuation.
- ltc – late classical Chinese tokenizer.
- zh – modern Chinese tokenizer
Unidecode (--unidecode
)¶
This flag adds an additional normalization step, transliterating accented characters using the Unidecode library.
For example, naïveté
becomes naivete
.
Rebuild (--rebuild
)¶
Re-tokenizes the corpus and recreates the configuration file.
HathiTrust Integration (--htrc
)¶
For use with a list of HathiTrust volumes. See HTRC Integrations - Working with Extracted Features.
Quiet Mode (-q
)¶
Suppresses all user input requests. Uses default values unless otherwise specified by other argument flags. Very useful for scripting automated pipelines.