vsm.viewer.LdaCgsViewer

class vsm.viewer.LdaCgsViewer(corpus, model)

A class for viewing a topic model estimated by one of vsm’s LDA classes using CGS.

Methods

__init__(corpus, model) Initialize LdaCgsViewer.
dismat_doc([docs, dist_fn]) Calculates the distance matrix for a given list of documents.
dismat_top([topics, dist_fn]) Calculates the distance matrix for a given list of topics.
dist_doc_doc(doc_or_docs[, print_len, ...]) Computes and sorts the distances between a document or list of documents and every document in the topic space.
dist_top_doc(topic_or_topics[, weights, ...]) Takes a topic or list of topics (by integer index) and returns a list of documents sorted by distance.
dist_top_top(topic_or_topics[, weights, ...]) Takes a topic or list of topics (by integer index) and returns a list of topics sorted by the distances between a given topic and every topic.
dist_word_top(word_or_words[, weights, ...]) Sorts topics according to their distance to the query word_or_words.
doc_topics(doc_or_docs[, sort_by_entropy, ...]) Returns the distribution over topics for the given documents.
logp_plot([range, step, show, grid]) Returns a plot of log probabilities for the specified range of
topic_entropies([print_len]) Returns the entropies of the topics of the model as an array sorted
topic_hist([topic_indices, d_indices, show]) Draws a histogram showing the proportion of topics within a set of documents specified by d_indices.
topics([print_len, topic_indices, ...]) Returns a list of topics estimated by the model.
word_topics(word[, as_strings]) Searches for every occurrence of word in the entire corpus and returns
__init__(corpus, model)

Initialize LdaCgsViewer.

Parameters:
  • corpus (Corpus) – Source of observed data.
  • model (LdaCgsSeq) – An LDA model estimated by a CGS.
dismat_doc(docs=[], dist_fn=<function JS_dist at 0x4859f50>)

Calculates the distance matrix for a given list of documents.

Parameters:
  • docs (list, optional) – A list of documents whose distance matrix is to be computed. Default is all the documents in the model.
  • dist_fn (string, optional) – A distance function from functions in vsm.spatial. Default is JS_dist().
Returns:

an instance of IndexedSymmArray. n x n matrix containing floats where n is the number of documents.

See Also:

vsm.viewer.wrapper.dismat_documents()

dismat_top(topics=[], dist_fn=<function JS_dist at 0x4859f50>)

Calculates the distance matrix for a given list of topics.

Parameters:
  • topic_indices (list, optional) – A list of topics whose distance matrix is to be computed. Default is all topics in the model.
  • dist_fn (string, optional) – A distance function from functions in vsm.spatial. Default is JS_dist().
Returns:

an instance of IndexedSymmArray. n x n matrix containing floats where n is the number of topics considered.

See Also:

vsm.viewer.wrapper.dismat_top()

dist_doc_doc(doc_or_docs, print_len=10, filter_nan=True, label_fn=<function def_label_fn at 0x49c5a28>, as_strings=True, dist_fn=<function JS_dist at 0x4859f50>, order='i')

Computes and sorts the distances between a document or list of documents and every document in the topic space.

Parameters:doc_or_docs – Query document(s) relative to which

distances are computed. :type doc_or_docs: string/integer or list of strings/integer.

Parameters:
  • print_len (int, optional) – Number of words printed by pretty-printing function. Default is 10.
  • filter_nan (boolean, optional) – If True not a number entries are filtered. Default is True.
  • label_fn (string, optional) – A function that defines how documents are represented. Default is def_label_fn which retrieves the labels from corpus metadata.
  • as_strings (boolean, optional) – If True, returns a list of words rather than their integer representations. Default is True.
  • dist_fn (string, optional) – A distance function from functions in vsm.spatial. Default is JS_dist().
  • order (string, optional) – Order of sorting. ‘i’ for increasing and ‘d’ for decreasing order. Default is ‘i’.
Returns:

an instance of LabeledColumn. A 2-dim array containing documents and their distances to doc_or_docs.

See Also:

vsm.viewer.wrapper.dist_doc_doc()

dist_top_doc(topic_or_topics, weights=[], filter_words=[], print_len=10, as_strings=True, label_fn=<function def_label_fn at 0x49c5a28>, filter_nan=True, dist_fn=<function JS_dist at 0x4859f50>, order='i')

Takes a topic or list of topics (by integer index) and returns a list of documents sorted by distance.

Parameters:
  • topic_or_topics (string or list of strings) – Query topic(s) relative to which distances are computed.
  • weights (list of floating point, optional) – Specify weights for each topic in topic_or_topics. Default uses equal weights (i.e. arithmetic mean)
  • filter_words (list of words, optional) – The topics that include these words are considered. If not provided, by default all topics are considered.
  • print_len (int, optional) – Number of documents printed by pretty-pringing function Default is 10.
  • as_strings (boolean, optional) – If True, returns a list of documents as strings rather than their integer representations. Default is True.
  • label_fn (string, optional) – A function that defines how documents are represented. Default is def_label_fn which retrieves the labels from corpus metadata.
  • filter_nan (boolean, optional) – If True not a number entries are filtered. Default is True.
  • dist_fn (string, optional) – A distance function from functions in vsm.spatial. Default is JS_dist().
  • order (string, optional) – Order of sorting. ‘i’ for increasing and ‘d’ for decreasing order. Default is ‘i’.
Returns:

an instance of LabeledColumn. A 2-dim array containing documents and their posterior probabilities to topic_or_topics.

See Also:

def_label_fn(), vsm.viewer.wrapper.dist_top_doc()

dist_top_top(topic_or_topics, weights=[], dist_fn=<function JS_dist at 0x4859f50>, order='i', show_topics=True, print_len=10, filter_nan=True, as_strings=True, compact_view=True)

Takes a topic or list of topics (by integer index) and returns a list of topics sorted by the distances between a given topic and every topic.

Parameters:
  • topic_or_topics (integer or list of integers) – Query topic(s) to which distances are calculated.
  • weights (list of floating point, optional) – Specify weights for each topic in topic_or_topics. Default uses equal weights (i.e. arithmetic mean)
  • show_topics (boolean, optional) – If True, topics are represented by their number and distribution over words. Otherwise only topic numbers are shown. Default is True.
  • print_len (int, optional) – Number of topics to be shown. Default is 10.
  • filter_nan (boolean, optional) – If True not a number entries are filtered. Default is True.
  • dist_fn (string, optional) – A distance function from functions in vsm.spatial. Default is JS_dist().
  • order (string, optional) – Order of sorting. ‘i’ for increasing and ‘d’ for decreasing order. Default is ‘i’.
  • as_strings (boolean, optional) – If True, words of each topic are represented as strings. Otherwise they are represented by their integer representation. Default is True.
  • compact_view (boolean, optional) – If True, topics are simply represented as their top print_len number of words. Otherwise, topics are shown as words and their probabilities. Default is True.
Returns:

an instance of LabeledColumn. A 2-dim array containing topics and their distances to topic_or_topics.

See Also:

vsm.viewer.wrapper.dist_top_top()

dist_word_top(word_or_words, weights=[], filter_nan=True, show_topics=True, print_len=10, as_strings=True, compact_view=True, dist_fn=<function JS_dist at 0x4859f50>, order='i')

Sorts topics according to their distance to the query word_or_words.

A pseudo-topic from word_or_words as follows. If weights are not provided, the word list is represented in the space of topics as a topic which assigns equal non-zero probability to each word in words and 0 to every other word in the corpus. Otherwise, each word in words is assigned the provided weight.

Parameters:
  • word_or_words (string or list of strings) – word(s) to which distances are calculated
  • weights (list of floating point, optional) – Specify weights for each query word in word_or_words. Default uses equal weights.
  • filter_nan (boolean, optional) – If True not a number entries are filtered. Default is True.
  • show_topics (boolean, optional) – If True, topics are represented by their number and distribution over words. Otherwise, only topic numbers are shown. Default is True.
  • print_len (int, optional) – Number of words printed by pretty-printing function. Default is 10.
  • as_strings (boolean, optional) – If True, words of each topic are represented as strings. Otherwise they are represented by their integer representation. Default is True.
  • compact_view (boolean, optional) – If True, topics are simply represented as their top print_len number of words. Otherwise, topics are shown as words and their probabilities. Default is True.
  • dist_fn (string, optional) – A distance function from functions in vsm.spatial. Default is JS_dist().
  • order (string, optional) – Order of sorting. ‘i’ for increasing and ‘d’ for decreasing order. Default is ‘i’.
Returns:

an instance of LabeledColumn. A structured array of topics sorted by their distances with word_or_words.

See Also:

vsm.viewer.wrapper.dist_word_top()

doc_topics(doc_or_docs, sort_by_entropy=False, compact_view=False, aggregate=False, print_len=10)

Returns the distribution over topics for the given documents.

Parameters:
  • doc (int or string) – Specifies the document whose distribution over topics is returned. It can either be the ID number (integer) or the name (string) of the document.
  • print_len (int, optional) – Number of topics to be printed. Default is 10.
Returns:

an instance of LabeledColumn or of :class: DataTable. An structured array of topics (represented by their number) and their corresponding probabilities or a list of such arrays.

logp_plot(range=[], step=1, show=True, grid=True)

Returns a plot of log probabilities for the specified range of the MCMC chain used to fit a topic model by LDAGibbs. The function requires matplotlib package.

Parameters:
  • range (list of integers, optional) – Specifies the range of the MCMC chain whose log probabilites are to be plotted. For example, range = [0, 999] plots log probabilities from the 1st to the 1000th iterations. The length of the list must be exactly two, and the first element must be smaller than the second which can not exceed the total length of the MCMC chain. Default produces the plot for the entire chain.
  • step (int, optional) – Steps by which points are plotted. Default is 1.
  • show (boolean, optional) – If True, the function actually draws the plot in addition to returning a plot object. Default is True.
  • grid (boolean, optional) – If True draw a grid. Default is True.
Returns:

an instance of matplotlib.pyplot object. Contains the log probability plot.

topic_entropies(print_len=10)

Returns the entropies of the topics of the model as an array sorted by entropy.

topic_hist(topic_indices=None, d_indices=[], show=True)

Draws a histogram showing the proportion of topics within a set of documents specified by d_indices.

Parameters:
  • topic_indices – Specifies the topics for which proportions are calculated. Default is all topics.
  • d_indices (boolean, optional) – Specifies the document for which topic proportions are calculated. Default is all documents.
  • show – shows plot if True. Default is True.
Returns:

an instance of matplotlib.pyplot object. Contains the topic proportion histogram.

topics(print_len=10, topic_indices=None, sort_by_entropy=False, as_strings=True, compact_view=False)

Returns a list of topics estimated by the model. Each topic is represented by a list of words and the corresponding probabilities.

Parameters:
  • topic_indices (list of integers) – List of indices of topics to be displayed. Default is all topics.
  • sort_by_entropy (boolean, optional) – Sorts topics by entropies. Default is False.
  • print_len (int, optional) – Number of words shown for each topic. Default is 10.
  • as_string (boolean, optional) – If True, each topic displays words rather than its integer representation. Default is True.
  • compact_view (boolean, optional) – If True, topics are simply represented as their top print_len number of words. Otherwise, topics are shown as words and their probabilities. Default is True.
Returns:

an instance of DataTable. A structured array of topics.

word_topics(word, as_strings=True)

Searches for every occurrence of word in the entire corpus and returns a list each row of which contains the name or ID number of document, the relative position in the document, and the assigned topic number for each occurrence of word.

Parameters:
  • word (string) – The word for which the search is performed.
  • as_strings (boolean, optional) – If True, returns document names rather than ID numbers. Default is True.
Returns:

an instance of LabeledColumn. A structured array consisting of three columns. Each column is a list of: (1) name/ID of document containing word (2) relative position of word in the document (3) Topic number assigned to the token.

Previous topic

vsm.viewer.BeagleViewer

Next topic

vsm.viewer.LsaViewer

This Page