#Text_mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality #information from #text.

15318316694200000

Top keywords (global influence):

Top topics (local contexts):

Explore the main topics and terms outlined above or see them in the excerpts from this text below.

See the relevant data in context: click here to show the excerpts from this text that contain these topics below.

Tip: use the form below to save the most relevant keywords for this search query. Or start writing your content and see how it relates to the existing search queries and results.

Tip: here are the keyword queries that people search for but don't actually find in the search results.

#Text_mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality #information from #text.

15318316694200000

@text_mining tasks include #text_categorization, #text_clustering, #concept_extraction, production of #granular_taxonomies, #sentiment_analysis, #document_summarization

15318365723260000

@sentiment_analysis #wordnet #conceptnet

15318366634190000

#text_mining #nlp

15318367076060000

#lsa is a technique in #nlp

15318367236120000

#LSA assumes that #words that are close in meaning will occur in similar pieces of #text

15318367507260000

#lsa can use a #term_document_matrix which describes the occurrences of #terms in #documents; it is a #sparse_matrix whose rows correspond to #terms and whose columns correspond to #documents.

15320042189630000

A typical example of the weighting of the elements of the #matrix in #lsa is #tf_idf (term #frequency–inverse document #frequency): the weight of an element of the #matrix is proportional to the number of times the #terms appear in each document, where rare #terms are upweighted to reflect their relative importance.

15320043157840000

#lsa can be used to #analyze #word_association in #text_corpus

15320043860300000

#lsa has been used to assist in performing #prior_art searches for #patents.

15320045590930000

The use of #lsa has been prevalent in the study of human #memory, especially in areas of#free_recall and #memory_search.

15320050137720000

#lsi is an #indexing and #retrieval #method

15320050868150000

#lsi is #lsa

15320051333970000

explicit semantic analysis (@esa) is a #vectoral_representation of #text (individual words or entire #documents) that uses a #document_corpus as a knowledge base. Specifically, in esa, a word is represented as a column vector in the #tf_idf matrix of the #text corpus and a #document (string of words) is represented as the centroid of the vectors representing its words.

15320056184080000

#document_corpus #document

15320056424310000

#documents #document

15320056482640000

@topic_model is a type of #statistical_model for discovering the abstract "#topics" that occur in a collection of #documents.

15320057777540000

An early #topic_model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998.[2] Another one, called probabilistic latent semantic analysis (#plsa), was created by Thomas Hofmann in 1999.[3] Latent Dirichlet allocation (#lda), perhaps the most common #topic_model currently in use, is a generalization of #plsa

15320058896010000

#plsa #lsa

15320059222910000

#lda introduces sparse #dirichlet prior distributions over #document_topic and #topic_word distributions, encoding the intuition that documents cover a small number of topics and that topics often use a small number of words.[

15320063707110000

#topic_models are generally extensions on #lda, such as Pachinko allocation, which improves on #lda by modeling #correlations between #topics in addition to the #word #correlations which constitute #topics

15320064037900000

#tf_idf or #tfidf, short for term #frequency–inverse #document #frequency, is a numerical #statistic that is intended to reflect how important a #word is to a #document in a collection or #corpus.

15320064666170000

83% of text-based #recommender systems in digital libraries use #tf_idf

15320067053150000

latent Dirichlet allocation (#lda) is a generative statistical #model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

15320068504510000

In #lda, each #document may be viewed as a mixture of various #topics where each #document is considered to have a set of #topics that are assigned to it via #lda. This is identical to probabilistic latent semantic analysis (#plsa), except that in #lda the #topic distribution is assumed to have a sparse #dirichlet prior.

15320068884790000

#lda is a generalization of the #plsa model, which is equivalent to #lda under a uniform #dirichlet prior #distribution.

15320069331320000

#lda and #lsa study commonalities between different #words (how often they are used together) to identify #topics that go together. this data can then be used to extract a right set of #documents for a #search #query or to see which #words are related to a #search #query

15320070877630000

pachinko allocation model (#pam) is a #topic model. Topic #models are a suite of algorithms to uncover the hidden thematic structure of a #collection of #documents. [1] The algorithm improves upon earlier #topic #models such as latent Dirichlet allocation (#lda) by modeling #correlations between #topics in addition to the #word #correlations which constitute #topics.

15320071752810000

#dirichlet process is a #probability #distribution whose range is itself a set of #probability #distribution

15320094708400000

#n_gram is a contiguous #sequence of n items from a given sample of #text or #speech.

15320145536110000

An #n_gram #model is a type of probabilistic language #model for predicting the next item in such a #sequence in the form of a (n − 1)–order #markov #model

15320148440240000

#rst in #text #summarization and other applications. #rst addresses #text organization by means of #relations that hold between parts of #text. It explains coherence by postulating a hierarchical, connected #structure of #texts

15320151967670000

#rhetorical_structure_theory #rst

15320152106830000

#semantic_folding theory describes a procedure for encoding the #semantics of natural #language #text in a semantically grounded #binary #representation.

15320153750780000

#semantic_compression is a process of compacting a #lexicon used to build a textual #document

15320154290740000

#tf_idf with #k_means #clustering,

15320170457310000

#k_means returns #cluster centroids as "#topics" and #lda assigns #words to the different #topics

15320170632450000

#topic_modeling through 4 of the most popular techniques today: #lsa, #plsa, #lda, and the newer, #deep_learning based #lda2vec. https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-#lda2vec-555ff65b0b05

15320172047250000

for #lsa we need a #corpus of #documents to analyze which #words belong to which #documents and to also weigh out the ones that are too frequently appearing in all the #texts

15320173127440000

This #dimensionality_reduction can be performed using truncated #svd so that #topic #document #matrix is converted into #term #topic #matrix

15320174941580000

#lda is a #bayesian version of #plsa.

15320176231750000

At the #word level, we typically use something like #word2vec to obtain #vector representations.

15320182832800000

#lda2vec is an extension of #word2vec and #lda that jointly learns #word, #document, and #topic_vectors

15320183084260000

Main Topical Groups:

N/A

+ ⤓ ?The topics are the nodes (words) that tend to co-occur together in the same context (next to each other).

We use a combination of clustering and graph community detection algorithm (Blondel et al based on Louvain) to identify the groups of nodes are more densely connected together than with the rest of the network. They are aligned closer to each other on the graph and are given a distinct color.

We use a combination of clustering and graph community detection algorithm (Blondel et al based on Louvain) to identify the groups of nodes are more densely connected together than with the rest of the network. They are aligned closer to each other on the graph and are given a distinct color.

Most Influential Elements:

N/A

+ ⤓ ↻ ?We use the Jenks elbow cutoff algorithm to select the top prominent nodes that have significantly higher influence than the rest.

Click the Reveal Non-obvious button to remove the most influential words (or the ones you select) from the graph, to see what terms are hiding behind them.

The most influential nodes are either the ones with the highest betweenness centrality — appearing most often on the shortest path between any two randomly chosen nodes (i.e. linking the different distinct communities) — or the ones with the highest degree.

Click the Reveal Non-obvious button to remove the most influential words (or the ones you select) from the graph, to see what terms are hiding behind them.

The most influential nodes are either the ones with the highest betweenness centrality — appearing most often on the shortest path between any two randomly chosen nodes (i.e. linking the different distinct communities) — or the ones with the highest degree.

Modularity | 0 |
Influence Distribution | 0 % |
---|

Topics | Nodes in Top Topic | Components | Nodes in Top Comp |
---|---|---|---|

0 |
0 % |
0 |
0 % |

Nodes | Av Degree | Density | |

0 |
0 |
0 |

Undo Select Export: PNG SVG Gexf

Action Advice:

N/A

Structural Gap

(ask a research question that would link these two topics):N/A

?A structural gap shows the two distinct communities (clusters of words) in this graph that are important, but not yet connected. That's where the new potential and innovative ideas may reside.

This measure is based on a combination of the graph's connectivity and community structure, selecting the groups of nodes that would either make the graph more connected if it's too dispersed or that would help maintain diversity if it's too connected.

This measure is based on a combination of the graph's connectivity and community structure, selecting the groups of nodes that would either make the graph more connected if it's too dispersed or that would help maintain diversity if it's too connected.

Latent Topical Brokers

:
N/A

?These are the latent brokers between the topics: the nodes that have an unusually high rate of influence (betweenness centrality) to their freqency — meaning they may appear not as often as the most influential nodes but they are important narrative shifting points.

These are usually brokers between different clusters / communities of nodes, playing not easily noticed and yet important role in this network, like the "grey cardinals" of sorts.

These are usually brokers between different clusters / communities of nodes, playing not easily noticed and yet important role in this network, like the "grey cardinals" of sorts.

Emerging Topics

N/A

Top Relations

:
The most prominent relations between the nodes that exist in this graph are shown above. We treat the graph as undirected by default as it allows us to better detect general patterns.

As an option, you can also downloaded directed bigrams above, in case the direction of the relations is important (for any application other than language).

As an option, you can also downloaded directed bigrams above, in case the direction of the relations is important (for any application other than language).

Main Topics

(according to Latent Dirichlet Allocation):Most Influential Words

(main topics and words according to LDA):LDA works only for English-language texts at the moment. More support is coming soon, subscribe @noduslabs to be informed.

Please, enter a search query to visualize the difference between what people search for (related queries) and what they actually find (search results):

Find a market niche for a certain product, category, idea or service: what people are looking for but cannot yet find*

Please, enter your query to visualize the search results as a graph, so you can learn more about this topic:

Enter a search query to analyze the Twitter discourse around this topic (last 7 days):