Transforms a term-frequency model into a term-frequency inverse-document-frequency model.
A TF-IDF model is term frequency model whose rows, corresponding to word types, are scaled by IDF values. The idea is that a word type which occurs in most of the contexts (i.e., documents) does less to distinguish the contexts semantically than does a word type which occurs in few of the contexts. The document frequency is the number of documents in which a word occurs divided by the number of documents. The IDF is the log of the inverse of the document frequency.
As with a term-frequency model, word types correspond to matrix rows and contexts correspond to matrix columns.
The data structure is a sparse float matrix.
See Also: | vsm.model.TfSeq, vsm.model.base, scipy.sparse.coo_matrix |
---|---|
Notes : | A zero in the matrix might arise in two ways: (1) the word type occurs in every document, in which case the IDF value is 0; (2) the word type occurs in no document at all, in which case the IDF value is undefined. |
Methods
__init__([tf_matrix, dtype, context_type]) | Initialize TfIdf. |
load(f) | Takes a filename or file object and loads it as an npz archive |
save(f) | Takes a filename or file object and saves self.matrix in an npz archive. |
train() | Computes the IDF values for the input term-frequency matrix, |
Initialize TfIdf.
Parameters: |
|
---|
Takes a filename or file object and loads it as an npz archive into a BaseModel object.
Parameters: | file (str-like or file-like object) – Designates the file to read. If file is a string ending in .gz, the file is first gunzipped. See numpy.load for further details. |
---|---|
Returns: | A dictionary storing the data found in file. |
See Also: | numpy.load() |
Takes a filename or file object and saves self.matrix in an npz archive.
Parameters: | file (str-like or file-like object) – Designates the file to which to save data. See numpy.savez for further details. |
---|---|
Returns: | None |
See Also: | numpy.savez() |
Computes the IDF values for the input term-frequency matrix, scales the rows by these values and stores the results in self.matrix.