vsm.corpus
[General documentation about the corpus submodule]
Classes
| Corpus(corpus[, context_types, ...]) |
The goal of the Corpus class is to provide an efficient representation of a textual corpus. |
-
class vsm.corpus.Corpus(corpus, context_types=[], context_data=[], remove_empty=True)
The goal of the Corpus class is to provide an efficient representation of a textual corpus.
A Corpus object contains an integer representation of the text and
maps to permit conversion between integer and string
representations of a given word.
As a BaseCorpus object, it includes a dictionary of tokenizations
of the corpus and a method for viewing (without copying) these
tokenizations. This dictionary also stores metadata (e.g.,
document names) associated with the available tokenizations.
| Parameters: |
- corpus (array-like) – A string array representing the corpus as a sequence of
atomic words.
- context_data (list-like with 1-D integer array-like elements, optional) – Each element in context_data is an array containing
the indices marking the token boundaries. An element in context_data is
intended for use as a value for the indices_or_sections
parameter in numpy.split. Elements of context_data may also be
1-D arrays whose elements are pairs, where the first element
is a context boundary and the second element is metadata
associated with that context preceding that boundary. For
example, (250, ‘dogs’) might indicate that the ‘article’ context
ending at the 250th word of the corpus is named ‘dogs’.
Default is None.
- context_types (array-like, optional) – Each element in context_types is a type of a context
in context_data.
|
| Attributes : |
- corpus (1-D 32-bit integer array)
corpus is the integer representation of the input string array-like
value of the corpus parameter
- words (1-D string array)
The indexed set of strings occurring in corpus. It is a string-typed array.
- words_in (1-D 32-bit integer dictionary)
A dictionary whose keys are words and whose values are their
corresponding integers (i.e., indices in words).
|
| Methods : |
- view_metadata
Takes a type of tokenization and returns a view of the metadata
of the tokenization.
- view_contexts
Takes a type of tokenization and returns a view of the corpus tokenized
accordingly. The optional parameter strings takes a boolean value:
True to view string representations of words; False to view integer
representations of words. Default is False.
- save
Takes a filename and saves the data contained in a Corpus object to
a npy file using numpy.savez.
- load
Static method. Takes a filename, loads the file data into a Corpus
object and returns the object.
- apply_stoplist
Takes a list of stopwords and returns a copy of the corpus with
the stopwords removed.
- tolist
Returns Corpus object as a list of lists of either integers or strings,
according to as_strings.
|
| See Also: | BaseCorpus
|
Examples
>>> text = ['I', 'came', 'I', 'saw', 'I', 'conquered']
>>> context_types = ['sentences']
>>> context_data = [np.array([(2, 'Veni'), (4, 'Vidi'), (6, 'Vici')],
dtype=[('idx', '<i8'), ('sent_label', '|S6')])]
>>> from vsm.corpus import Corpus
>>> c = Corpus(text, context_types=context_types, context_data=context_data)
>>> c.corpus
array([0, 1, 0, 2, 0, 3], dtype=int32)
>>> c.words
array(['I', 'came', 'saw', 'conquered'],
dtype='|S9')
>>> c.view_contexts('sentences')
[array([0, 3], dtype=int32), array([0, 2], dtype=int32),
array([0, 1], dtype=int32)]
>>> c.view_contexts('sentences', as_strings=True)
[array(['I', 'came'],
dtype='|S9'),
array(['I', 'saw'],
dtype='|S9'),
array(['I', 'conquered'],
dtype='|S9')]
>>> c.view_metadata('sentences')[1]['sent_label']
'Vidi'
>>> c = c.apply_stoplist(['saw'])
>>> c.words
array(['I', 'came', 'conquered'],
dtype='|S9')
-
apply_stoplist(stoplist=[], freq=0)
Takes a Corpus object and returns a copy of it with words in the
stoplist removed and with words of frequency <= freq removed.
| Parameters: |
- stoplist (list) – The list of words to be removed.
- freq (integer, optional) – A threshold where words of frequency <= ‘freq’ are
removed. Default is 0.
|
| Returns: | Copy of corpus with words in the stoplist and words
of frequnecy <= ‘freq’ removed.
|
| See Also: | Corpus
|
-
static load(file)
Loads data into a Corpus object that has been stored using
save.
| Parameters: | file (str-like or file-like object) – Designates the file to read. If file is a string ending
in .gz, the file is first gunzipped. See numpy.load
for further details. |
| Returns: | A Corpus object storing the data found in file. |
| See Also: | Corpus, Corpus.save(), numpy.load() |
-
save(file)
Saves data from a Corpus object as an npz file.
| Parameters: | file (str-like or file-like object) – Designates the file to which to save data. See
numpy.savez for further details. |
| Returns: | None |
| See Also: | Corpus, Corpus.load(), np.savez() |
-
tolist(context_type, as_strings=False)
Returns Corpus object as a list of lists of either integers or
strings, according to as_strings.
| Parameters: |
- context_type (string) – The type of tokenization.
- as_strings (Boolean, optional) – If True, string representations of words are returned.
Otherwise, integer representations are returned. Default
is False.
|
| Returns: | List of lists
|
-
view_contexts(ctx_type, as_strings=False, as_slices=False, as_indices=False)
Displays a tokenization of the corpus.
| Parameters: |
- ctx_type (string-like) – The type of a tokenization.
- as_strings (Boolean, optional) – If True, string representations of words are returned.
Otherwise, integer representations are returned. Default
is False.
- as_slices (Boolean, optional) – If True, a list of slices corresponding to ‘ctx_type’
is returned. Otherwise, integer representations are returned.
Default is False.
|
| Returns: | A tokenized view of corpus.
|
| See Also: | Corpus, BaseCorpus
|