#Text_mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality #information from #text.

   show filtered

 

@text_mining tasks include #text_categorization, #text_clustering, #concept_extraction, production of #granular_taxonomies, #sentiment_analysis, #document_summarization

   show filtered

 

@sentiment_analysis #wordnet #conceptnet

   show filtered

 

#text_mining #nlp

   show filtered

 

#lsa is a technique in #nlp

   show filtered

 

#LSA assumes that #words that are close in meaning will occur in similar pieces of #text

   show filtered

 

#lsa can use a #term_document_matrix which describes the occurrences of #terms in #documents; it is a #sparse_matrix whose rows correspond to #terms and whose columns correspond to #documents.

   show filtered

 

A typical example of the weighting of the elements of the #matrix in #lsa is #tf_idf (term #frequency–inverse document #frequency): the weight of an element of the #matrix is proportional to the number of times the #terms appear in each document, where rare #terms are upweighted to reflect their relative importance.

   show filtered

 

#lsa can be used to #analyze #word_association in #text_corpus

   show filtered

 

#lsa has been used to assist in performing #prior_art searches for #patents.

   show filtered

 

The use of #lsa has been prevalent in the study of human #memory, especially in areas of#free_recall and #memory_search.

   show filtered

 

#lsi is an #indexing and #retrieval #method

   show filtered

 

#lsi is #lsa

   show filtered

 

explicit semantic analysis (@esa) is a #vectoral_representation of #text (individual words or entire #documents) that uses a #document_corpus as a knowledge base. Specifically, in esa, a word is represented as a column vector in the #tf_idf matrix of the #text corpus and a #document (string of words) is represented as the centroid of the vectors representing its words.

   show filtered

 

#document_corpus #document

   show filtered

 

#documents #document

   show filtered

 

@topic_model is a type of #statistical_model for discovering the abstract "#topics" that occur in a collection of #documents.

   show filtered

 

An early #topic_model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998.[2] Another one, called probabilistic latent semantic analysis (#plsa), was created by Thomas Hofmann in 1999.[3] Latent Dirichlet allocation (#lda), perhaps the most common #topic_model currently in use, is a generalization of #plsa

   show filtered

 

#plsa #lsa

   show filtered

 

#lda introduces sparse #dirichlet prior distributions over #document_topic and #topic_word distributions, encoding the intuition that documents cover a small number of topics and that topics often use a small number of words.[

   show filtered

 

#topic_models are generally extensions on #lda, such as Pachinko allocation, which improves on #lda by modeling #correlations between #topics in addition to the #word #correlations which constitute #topics

   show filtered

 

#tf_idf or #tfidf, short for term #frequency–inverse #document #frequency, is a numerical #statistic that is intended to reflect how important a #word is to a #document in a collection or #corpus.

   show filtered

 

83% of text-based #recommender systems in digital libraries use #tf_idf

   show filtered

 

latent Dirichlet allocation (#lda) is a generative statistical #model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

   show filtered

 

In #lda, each #document may be viewed as a mixture of various #topics where each #document is considered to have a set of #topics that are assigned to it via #lda. This is identical to probabilistic latent semantic analysis (#plsa), except that in #lda the #topic distribution is assumed to have a sparse #dirichlet prior.

   show filtered

 

#lda is a generalization of the #plsa model, which is equivalent to #lda under a uniform #dirichlet prior #distribution.

   show filtered

 

#lda and #lsa study commonalities between different #words (how often they are used together) to identify #topics that go together. this data can then be used to extract a right set of #documents for a #search #query or to see which #words are related to a #search #query

   show filtered

 

pachinko allocation model (#pam) is a #topic model. Topic #models are a suite of algorithms to uncover the hidden thematic structure of a #collection of #documents. [1] The algorithm improves upon earlier #topic #models such as latent Dirichlet allocation (#lda) by modeling #correlations between #topics in addition to the #word #correlations which constitute #topics.

   show filtered

 

#dirichlet process is a #probability #distribution whose range is itself a set of #probability #distribution

   show filtered

 

#n_gram is a contiguous #sequence of n items from a given sample of #text or #speech.

   show filtered

 

An #n_gram #model is a type of probabilistic language #model for predicting the next item in such a #sequence in the form of a (n − 1)–order #markov #model

   show filtered

 

#rst in #text #summarization and other applications. #rst addresses #text organization by means of #relations that hold between parts of #text. It explains coherence by postulating a hierarchical, connected #structure of #texts

   show filtered

 

#rhetorical_structure_theory #rst

   show filtered

 

#semantic_folding theory describes a procedure for encoding the #semantics of natural #language #text in a semantically grounded #binary #representation.

   show filtered

 

#semantic_compression is a process of compacting a #lexicon used to build a textual #document

   show filtered

 

#tf_idf with #k_means #clustering,

   show filtered

 

#k_means returns #cluster centroids as "#topics" and #lda assigns #words to the different #topics

   show filtered

 

#topic_modeling through 4 of the most popular techniques today: #lsa, #plsa, #lda, and the newer, #deep_learning based #lda2vec. https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-#lda2vec-555ff65b0b05

   show filtered

 

for #lsa we need a #corpus of #documents to analyze which #words belong to which #documents and to also weigh out the ones that are too frequently appearing in all the #texts

   show filtered

 

This #dimensionality_reduction can be performed using truncated #svd so that #topic #document #matrix is converted into #term #topic #matrix

   show filtered

 

#lda is a #bayesian version of #plsa.

   show filtered

 

At the #word level, we typically use something like #word2vec to obtain #vector representations.

   show filtered

 

#lda2vec is an extension of #word2vec and #lda that jointly learns #word, #document, and #topic_vectors

   show filtered

 
in context: