vsm.model.TfIdf

class vsm.model.TfIdf(tf_matrix=array([], dtype=float64), context_type=None)

Transforms a term-frequency model into a term-frequency inverse-document-frequency model.

A TF-IDF model is term frequency model whose rows, corresponding to word types, are scaled by IDF values. The idea is that a word type which occurs in most of the contexts (i.e., documents) does less to distinguish the contexts semantically than does a word type which occurs in few of the contexts. The document frequency is the number of documents in which a word occurs divided by the number of documents. The IDF is the log of the inverse of the document frequency.

As with a term-frequency model, word types correspond to matrix rows and contexts correspond to matrix columns.

The data structure is a sparse float matrix.

See Also:vsm.model.TfSeq, vsm.model.base, scipy.sparse.coo_matrix
Notes :A zero in the matrix might arise in two ways: (1) the word type occurs in every document, in which case the IDF value is 0; (2) the word type occurs in no document at all, in which case the IDF value is undefined.

Methods

__init__([tf_matrix, dtype, context_type]) Initialize TfIdf.
load(f) Takes a filename or file object and loads it as an npz archive
save(f) Takes a filename or file object and saves self.matrix in an npz archive.
train() Computes the IDF values for the input term-frequency matrix,
__init__(tf_matrix=array([], dtype=float64), context_type=None)

Initialize TfIdf.

Parameters:
  • tf_matrix (scipy.sparse matrix) – A matrix containing the term-frequency data.
  • context_type (string) – A string specifying the type of context over which the model trainer is applied.
static load(f)

Takes a filename or file object and loads it as an npz archive into a BaseModel object.

Parameters:file (str-like or file-like object) – Designates the file to read. If file is a string ending in .gz, the file is first gunzipped. See numpy.load for further details.
Returns:A dictionary storing the data found in file.
See Also:numpy.load()
save(f)

Takes a filename or file object and saves self.matrix in an npz archive.

Parameters:file (str-like or file-like object) – Designates the file to which to save data. See numpy.savez for further details.
Returns:None
See Also:numpy.savez()
train()

Computes the IDF values for the input term-frequency matrix, scales the rows by these values and stores the results in self.matrix.

Previous topic

vsm.model.Lsa

Next topic

vsm.model.TfMulti

This Page