vsm.corpus

[General documentation about the corpus submodule]

Classes

Corpus(corpus[, context_types, ...]) The goal of the Corpus class is to provide an efficient representation of a textual corpus.
class vsm.corpus.Corpus(corpus, context_types=[], context_data=[], remove_empty=True)

The goal of the Corpus class is to provide an efficient representation of a textual corpus.

A Corpus object contains an integer representation of the text and maps to permit conversion between integer and string representations of a given word.

As a BaseCorpus object, it includes a dictionary of tokenizations of the corpus and a method for viewing (without copying) these tokenizations. This dictionary also stores metadata (e.g., document names) associated with the available tokenizations.

Parameters:
  • corpus (array-like) – A string array representing the corpus as a sequence of atomic words.
  • context_data (list-like with 1-D integer array-like elements, optional) – Each element in context_data is an array containing the indices marking the token boundaries. An element in context_data is intended for use as a value for the indices_or_sections parameter in numpy.split. Elements of context_data may also be 1-D arrays whose elements are pairs, where the first element is a context boundary and the second element is metadata associated with that context preceding that boundary. For example, (250, ‘dogs’) might indicate that the ‘article’ context ending at the 250th word of the corpus is named ‘dogs’. Default is None.
  • context_types (array-like, optional) – Each element in context_types is a type of a context in context_data.
Attributes :
  • corpus (1-D 32-bit integer array)

    corpus is the integer representation of the input string array-like value of the corpus parameter

  • words (1-D string array)

    The indexed set of strings occurring in corpus. It is a string-typed array.

  • words_in (1-D 32-bit integer dictionary)

    A dictionary whose keys are words and whose values are their corresponding integers (i.e., indices in words).

Methods :
  • view_metadata

    Takes a type of tokenization and returns a view of the metadata of the tokenization.

  • view_contexts

    Takes a type of tokenization and returns a view of the corpus tokenized accordingly. The optional parameter strings takes a boolean value: True to view string representations of words; False to view integer representations of words. Default is False.

  • save

    Takes a filename and saves the data contained in a Corpus object to a npy file using numpy.savez.

  • load

    Static method. Takes a filename, loads the file data into a Corpus object and returns the object.

  • apply_stoplist

    Takes a list of stopwords and returns a copy of the corpus with the stopwords removed.

  • tolist

    Returns Corpus object as a list of lists of either integers or strings, according to as_strings.

See Also:

BaseCorpus

Examples

>>> text = ['I', 'came', 'I', 'saw', 'I', 'conquered']
>>> context_types = ['sentences']
>>> context_data = [np.array([(2, 'Veni'), (4, 'Vidi'), (6, 'Vici')],
                        dtype=[('idx', '<i8'), ('sent_label', '|S6')])]
>>> from vsm.corpus import Corpus
>>> c = Corpus(text, context_types=context_types, context_data=context_data)
>>> c.corpus
array([0, 1, 0, 2, 0, 3], dtype=int32)
>>> c.words
array(['I', 'came', 'saw', 'conquered'],
      dtype='|S9')
>>> c.words_int['saw']
2
>>> c.view_contexts('sentences')
[array([0, 3], dtype=int32), array([0, 2], dtype=int32),
 array([0, 1], dtype=int32)]
>>> c.view_contexts('sentences', as_strings=True)
    [array(['I', 'came'], 
          dtype='|S9'),
     array(['I', 'saw'], 
          dtype='|S9'),
     array(['I', 'conquered'], 
          dtype='|S9')]
>>> c.view_metadata('sentences')[1]['sent_label']
'Vidi'
>>> c = c.apply_stoplist(['saw'])
>>> c.words
array(['I', 'came', 'conquered'], 
  dtype='|S9')
apply_stoplist(stoplist=[], freq=0)

Takes a Corpus object and returns a copy of it with words in the stoplist removed and with words of frequency <= freq removed.

Parameters:
  • stoplist (list) – The list of words to be removed.
  • freq (integer, optional) – A threshold where words of frequency <= ‘freq’ are removed. Default is 0.
Returns:

Copy of corpus with words in the stoplist and words of frequnecy <= ‘freq’ removed.

See Also:

Corpus

static load(file)

Loads data into a Corpus object that has been stored using save.

Parameters:file (str-like or file-like object) – Designates the file to read. If file is a string ending in .gz, the file is first gunzipped. See numpy.load for further details.
Returns:A Corpus object storing the data found in file.
See Also:Corpus, Corpus.save(), numpy.load()
save(file)

Saves data from a Corpus object as an npz file.

Parameters:file (str-like or file-like object) – Designates the file to which to save data. See numpy.savez for further details.
Returns:None
See Also:Corpus, Corpus.load(), np.savez()
tolist(context_type, as_strings=False)

Returns Corpus object as a list of lists of either integers or strings, according to as_strings.

Parameters:
  • context_type (string) – The type of tokenization.
  • as_strings (Boolean, optional) – If True, string representations of words are returned. Otherwise, integer representations are returned. Default is False.
Returns:

List of lists

view_contexts(ctx_type, as_strings=False, as_slices=False, as_indices=False)

Displays a tokenization of the corpus.

Parameters:
  • ctx_type (string-like) – The type of a tokenization.
  • as_strings (Boolean, optional) – If True, string representations of words are returned. Otherwise, integer representations are returned. Default is False.
  • as_slices (Boolean, optional) – If True, a list of slices corresponding to ‘ctx_type’ is returned. Otherwise, integer representations are returned. Default is False.
Returns:

A tokenized view of corpus.

See Also:

Corpus, BaseCorpus

Previous topic

vsm

Next topic

vsm.model

This Page