[General documentation about the corpus submodule]
Classes
Corpus(corpus[, context_types, ...]) | The goal of the Corpus class is to provide an efficient representation of a textual corpus. |
The goal of the Corpus class is to provide an efficient representation of a textual corpus.
A Corpus object contains an integer representation of the text and maps to permit conversion between integer and string representations of a given word.
As a BaseCorpus object, it includes a dictionary of tokenizations of the corpus and a method for viewing (without copying) these tokenizations. This dictionary also stores metadata (e.g., document names) associated with the available tokenizations.
Parameters: |
|
---|---|
Attributes : |
|
Methods : |
|
See Also: | BaseCorpus |
Examples
>>> text = ['I', 'came', 'I', 'saw', 'I', 'conquered']
>>> context_types = ['sentences']
>>> context_data = [np.array([(2, 'Veni'), (4, 'Vidi'), (6, 'Vici')],
dtype=[('idx', '<i8'), ('sent_label', '|S6')])]
>>> from vsm.corpus import Corpus
>>> c = Corpus(text, context_types=context_types, context_data=context_data)
>>> c.corpus
array([0, 1, 0, 2, 0, 3], dtype=int32)
>>> c.words
array(['I', 'came', 'saw', 'conquered'],
dtype='|S9')
>>> c.words_int['saw']
2
>>> c.view_contexts('sentences')
[array([0, 3], dtype=int32), array([0, 2], dtype=int32),
array([0, 1], dtype=int32)]
>>> c.view_contexts('sentences', as_strings=True)
[array(['I', 'came'],
dtype='|S9'),
array(['I', 'saw'],
dtype='|S9'),
array(['I', 'conquered'],
dtype='|S9')]
>>> c.view_metadata('sentences')[1]['sent_label']
'Vidi'
>>> c = c.apply_stoplist(['saw'])
>>> c.words
array(['I', 'came', 'conquered'],
dtype='|S9')
Takes a Corpus object and returns a copy of it with words in the stoplist removed and with words of frequency <= freq removed.
Parameters: |
|
---|---|
Returns: | Copy of corpus with words in the stoplist and words of frequnecy <= ‘freq’ removed. |
See Also: |
Loads data into a Corpus object that has been stored using save.
Parameters: | file (str-like or file-like object) – Designates the file to read. If file is a string ending in .gz, the file is first gunzipped. See numpy.load for further details. |
---|---|
Returns: | A Corpus object storing the data found in file. |
See Also: | Corpus, Corpus.save(), numpy.load() |
Saves data from a Corpus object as an npz file.
Parameters: | file (str-like or file-like object) – Designates the file to which to save data. See numpy.savez for further details. |
---|---|
Returns: | None |
See Also: | Corpus, Corpus.load(), np.savez() |
Returns Corpus object as a list of lists of either integers or strings, according to as_strings.
Parameters: |
|
---|---|
Returns: | List of lists |
Displays a tokenization of the corpus.
Parameters: |
|
---|---|
Returns: | A tokenized view of corpus. |
See Also: | Corpus, BaseCorpus |