#Text_mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality #information from #text.

15318316694200000

#Text_mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality #information from #text.

15318316694200000

@text_mining tasks include #text_categorization, #text_clustering, #concept_extraction, production of #granular_taxonomies, #sentiment_analysis, #document_summarization

15318365723260000

@sentiment_analysis #wordnet #conceptnet

15318366634190000

#text_mining #nlp

15318367076060000

#lsa is a technique in #nlp

15318367236120000

#LSA assumes that #words that are close in meaning will occur in similar pieces of #text

15318367507260000

#lsa can use a #term_document_matrix which describes the occurrences of #terms in #documents; it is a #sparse_matrix whose rows correspond to #terms and whose columns correspond to #documents.

15320042189630000

A typical example of the weighting of the elements of the #matrix in #lsa is #tf_idf (term #frequency–inverse document #frequency): the weight of an element of the #matrix is proportional to the number of times the #terms appear in each document, where rare #terms are upweighted to reflect their relative importance.

15320043157840000

#lsa can be used to #analyze #word_association in #text_corpus

15320043860300000

#lsa has been used to assist in performing #prior_art searches for #patents.

15320045590930000

The use of #lsa has been prevalent in the study of human #memory, especially in areas of#free_recall and #memory_search.

15320050137720000

#lsi is an #indexing and #retrieval #method

15320050868150000

#lsi is #lsa

15320051333970000

explicit semantic analysis (@esa) is a #vectoral_representation of #text (individual words or entire #documents) that uses a #document_corpus as a knowledge base. Specifically, in esa, a word is represented as a column vector in the #tf_idf matrix of the #text corpus and a #document (string of words) is represented as the centroid of the vectors representing its words.

15320056184080000

#document_corpus #document

15320056424310000

#documents #document

15320056482640000

@topic_model is a type of #statistical_model for discovering the abstract "#topics" that occur in a collection of #documents.

15320057777540000

An early #topic_model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998.[2] Another one, called probabilistic latent semantic analysis (#plsa), was created by Thomas Hofmann in 1999.[3] Latent Dirichlet allocation (#lda), perhaps the most common #topic_model currently in use, is a generalization of #plsa

15320058896010000

#plsa #lsa

15320059222910000

#lda introduces sparse #dirichlet prior distributions over #document_topic and #topic_word distributions, encoding the intuition that documents cover a small number of topics and that topics often use a small number of words.[

15320063707110000

#topic_models are generally extensions on #lda, such as Pachinko allocation, which improves on #lda by modeling #correlations between #topics in addition to the #word #correlations which constitute #topics

15320064037900000

#tf_idf or #tfidf, short for term #frequency–inverse #document #frequency, is a numerical #statistic that is intended to reflect how important a #word is to a #document in a collection or #corpus.

15320064666170000

83% of text-based #recommender systems in digital libraries use #tf_idf

15320067053150000

latent Dirichlet allocation (#lda) is a generative statistical #model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

15320068504510000

In #lda, each #document may be viewed as a mixture of various #topics where each #document is considered to have a set of #topics that are assigned to it via #lda. This is identical to probabilistic latent semantic analysis (#plsa), except that in #lda the #topic distribution is assumed to have a sparse #dirichlet prior.

15320068884790000

#lda is a generalization of the #plsa model, which is equivalent to #lda under a uniform #dirichlet prior #distribution.

15320069331320000

#lda and #lsa study commonalities between different #words (how often they are used together) to identify #topics that go together. this data can then be used to extract a right set of #documents for a #search #query or to see which #words are related to a #search #query

15320070877630000

pachinko allocation model (#pam) is a #topic model. Topic #models are a suite of algorithms to uncover the hidden thematic structure of a #collection of #documents. [1] The algorithm improves upon earlier #topic #models such as latent Dirichlet allocation (#lda) by modeling #correlations between #topics in addition to the #word #correlations which constitute #topics.

15320071752810000

#dirichlet process is a #probability #distribution whose range is itself a set of #probability #distribution

15320094708400000

#n_gram is a contiguous #sequence of n items from a given sample of #text or #speech.

15320145536110000

An #n_gram #model is a type of probabilistic language #model for predicting the next item in such a #sequence in the form of a (n − 1)–order #markov #model

15320148440240000

#rst in #text #summarization and other applications. #rst addresses #text organization by means of #relations that hold between parts of #text. It explains coherence by postulating a hierarchical, connected #structure of #texts

15320151967670000

#rhetorical_structure_theory #rst

15320152106830000

#semantic_folding theory describes a procedure for encoding the #semantics of natural #language #text in a semantically grounded #binary #representation.

15320153750780000

#semantic_compression is a process of compacting a #lexicon used to build a textual #document

15320154290740000

#tf_idf with #k_means #clustering,

15320170457310000

#k_means returns #cluster centroids as "#topics" and #lda assigns #words to the different #topics

15320170632450000

#topic_modeling through 4 of the most popular techniques today: #lsa, #plsa, #lda, and the newer, #deep_learning based #lda2vec. https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-#lda2vec-555ff65b0b05

15320172047250000

for #lsa we need a #corpus of #documents to analyze which #words belong to which #documents and to also weigh out the ones that are too frequently appearing in all the #texts

15320173127440000

This #dimensionality_reduction can be performed using truncated #svd so that #topic #document #matrix is converted into #term #topic #matrix

15320174941580000

#lda is a #bayesian version of #plsa.

15320176231750000

At the #word level, we typically use something like #word2vec to obtain #vector representations.

15320182832800000

#lda2vec is an extension of #word2vec and #lda that jointly learns #word, #document, and #topic_vectors

15320183084260000