topicexplorer init

This module initializes the topicexplorer. It performs tokenization of the corpus.

It creates a .ini configuration with the settings for this instance. It also creates a vsm.Corpus object saved in a .npz file. The corpus object stores basic information about documents and the words they contain.

Tokenization and Normalization

Tokenization is the process of segmenting text into discrete units, or tokens.

Normalization is the process of folding equivalent tokens together.

For example, The and the are normalized to a single, lower-case the token. The token Lower-case is normalized to lowercase with the default tokenizer.

Note

Note that the would commonly be “stopped” or removed from the corpus. This occurs during the topicexplorer prep stage.

Tokenization is language-dependent. While many languages (including English) can be tokenized by splitting on whitespace characters, other languages (such as Chinese) require more advanced techniques.

topicexplorer init includes several tokenizers, including two Chinese-language functions, selected with the --tokenizer argument.

See also

Introduction to Information Retrieval – Tokenization
Stanford Information Retrieval book section introducing tokenization.
Introduction to Information Retrieval – Normalization
Stanford Information Retrieval book section introducing normalization.

Input Formats

topicexplorer init can generate corpora from several types of data:

  • Plain-text Files (.txt)
  • PDF files (.pdf)
  • BibTeX files (.bib) with file fields pointing to PDF files

Other types of files will need to first be converted to plain-text. We recommend pandoc as a way to convert other filetypes.

PDFs

PDF files will first be converted to plaintext using the pdfminer library. This process creates a new directory with the suffix -txt in the same location as the corpus path.

Documents
|-- example
|-- example-txt

When serving the visualization with topicexplorer launch example --fulltext, the original PDFs will be served from the example folder. If example is removed, the plain-text will be served from the example-txt folder.

BibTeX

topicexplorer init can build a corpus from appropriately marked-up BibTeX files, such as those generated by Mendeley.

Each BibTeX entry will require a file = {} directive that locates the PDF file on the file system.

@inproceedings{Murdock2015,
    author = {Murdock, Jaimie and Allen, Colin},
    title = {{Visualization Techniques for Topic Model Checking}},
    year = {2015}
    booktitle = {2015 Association for the Advancement of Artificial Intelligence},
    pages = {3},
    address = {Austin, TX},
    publisher = {AAAI Press},
    file = {:home/jaimie/Downloads/10007-45032-1-PB.pdf:pdf},
    keywords = {applications,nlp,topic modeling,visualization},
    mendeley-groups = {Publications,Topic Modeling},
    url = {http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/10007},
}

Plain-text will be extracted using the method described above in PDF files. Metadata is added to the visualization using the .bib file.

Command Line Arguments

Corpus Name (--name)

The name of the corpus displayed in the visualizations.

Note

Make sure to surround the corpus name with quotes. For example:

topicexplorer init --name "The Example Corpus" example

Model Path (--model-path)

The path to store the model files.

Defaults to a models directory at the same level as the corpus directory.

topicexplorer init example

Documents
|-- example
|-- models

Tokenizer Selection (--tokenizer)

topicexplorer includes several tokenizers, selected with the --tokenizer flag. They are:

  • default – default tokenizer. normalizes with removed punctuation and digits. Re-hyphenates multi-character dashes by adding spaces to split words.
  • simple – normalies by removing digits and a limited set of punctuation.
  • ltc – late classical Chinese tokenizer.
  • zh – modern Chinese tokenizer

Unidecode (--unidecode)

This flag adds an additional normalization step, transliterating accented characters using the Unidecode library.

For example, naïveté becomes naivete.

Rebuild (--rebuild)

Re-tokenizes the corpus and recreates the configuration file.

HathiTrust Integration (--htrc)

For use with a list of HathiTrust volumes. See HTRC Integrations - Working with Extracted Features.

Quiet Mode (-q)

Suppresses all user input requests. Uses default values unless otherwise specified by other argument flags. Very useful for scripting automated pipelines.