#Text_mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality #information from #text.

15318316690000000

× ⁝⁝

Graph Language Processing Settings:

× ⁝⁝

Dynamic Graph Settings

Play the Graph

0 2000

× ⁝⁝

PNG (Image) SVG (Hi-Res)

CSV (Spreadsheet) MD (e.g.Obsidian)

JSON CSV Gexf (Gephi)

Download Plain Text (All Statements)

Export the Data

Network Graph Images:

PNG (Image) SVG (Hi-Res)

Visible Statements (Tagged):

CSV (Spreadsheet) MD (e.g.Obsidian)

Network Graph Data:

JSON CSV Gexf (Gephi)

All the Text:

Download Plain Text (All Statements)

× ⁝⁝

Save This Graph View:

× ⁝⁝

Delete This Graph:

× ⁝⁝

InfraNodus
Project Notes:

Top keywords (global influence):

Top topics (local contexts):

Explore the main topics and terms outlined above or see them in the excerpts from this text below.

See the relevant data in context: click here to show the excerpts from this text that contain these topics below.

Tip: use the form below to save the most relevant keywords for this search query. Or start writing your content and see how it relates to the existing search queries and results.

Tip: here are the keyword queries that people search for but don't actually find in the search results.

#Text_mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality #information from #text.

15318316690000000

[text_mining]

@text_mining tasks include #text_categorization, #text_clustering, #concept_extraction, production of #granular_taxonomies, #sentiment_analysis, #document_summarization

15318365720000000

[sentiment_analysis]

@sentiment_analysis #wordnet #conceptnet

15318366630000000

#text_mining #nlp

15318367070000000

#lsa is a technique in #nlp

15318367230000000

#LSA assumes that #words that are close in meaning will occur in similar pieces of #text

15318367500000000

#lsa can use a #term_document_matrix which describes the occurrences of #terms in #documents; it is a #sparse_matrix whose rows correspond to #terms and whose columns correspond to #documents.

15320042180000000

A typical example of the weighting of the elements of the #matrix in #lsa is #tf_idf (term #frequency–inverse document #frequency): the weight of an element of the #matrix is proportional to the number of times the #terms appear in each document, where rare #terms are upweighted to reflect their relative importance.

15320043150000000

#lsa can be used to #analyze #word_association in #text_corpus

15320043860000000

#lsa has been used to assist in performing #prior_art searches for #patents.

15320045590000000

The use of #lsa has been prevalent in the study of human #memory, especially in areas of#free_recall and #memory_search.

15320050130000000

#lsi is an #indexing and #retrieval #method

15320050860000000

#lsi is #lsa

15320051330000000

[esa]

explicit semantic analysis (@esa) is a #vectoral_representation of #text (individual words or entire #documents) that uses a #document_corpus as a knowledge base. Specifically, in esa, a word is represented as a column vector in the #tf_idf matrix of the #text corpus and a #document (string of words) is represented as the centroid of the vectors representing its words.

15320056180000000

#document_corpus #document

15320056420000000

#documents #document

15320056480000000

[topic_model]

@topic_model is a type of #statistical_model for discovering the abstract "#topics" that occur in a collection of #documents.

15320057770000000

An early #topic_model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998.[2] Another one, called probabilistic latent semantic analysis (#plsa), was created by Thomas Hofmann in 1999.[3] Latent Dirichlet allocation (#lda), perhaps the most common #topic_model currently in use, is a generalization of #plsa

15320058890000000

#plsa #lsa

15320059220000000

#lda introduces sparse #dirichlet prior distributions over #document_topic and #topic_word distributions, encoding the intuition that documents cover a small number of topics and that topics often use a small number of words.[

15320063700000000

#topic_models are generally extensions on #lda, such as Pachinko allocation, which improves on #lda by modeling #correlations between #topics in addition to the #word #correlations which constitute #topics

15320064030000000

#tf_idf or #tfidf, short for term #frequency–inverse #document #frequency, is a numerical #statistic that is intended to reflect how important a #word is to a #document in a collection or #corpus.

15320064660000000

83% of text-based #recommender systems in digital libraries use #tf_idf

15320067050000000

latent Dirichlet allocation (#lda) is a generative statistical #model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

15320068500000000

In #lda, each #document may be viewed as a mixture of various #topics where each #document is considered to have a set of #topics that are assigned to it via #lda. This is identical to probabilistic latent semantic analysis (#plsa), except that in #lda the #topic distribution is assumed to have a sparse #dirichlet prior.

15320068880000000

#lda is a generalization of the #plsa model, which is equivalent to #lda under a uniform #dirichlet prior #distribution.

15320069330000000

#lda and #lsa study commonalities between different #words (how often they are used together) to identify #topics that go together. this data can then be used to extract a right set of #documents for a #search #query or to see which #words are related to a #search #query

15320070870000000

pachinko allocation model (#pam) is a #topic model. Topic #models are a suite of algorithms to uncover the hidden thematic structure of a #collection of #documents. [1] The algorithm improves upon earlier #topic #models such as latent Dirichlet allocation (#lda) by modeling #correlations between #topics in addition to the #word #correlations which constitute #topics.

15320071750000000

#dirichlet process is a #probability #distribution whose range is itself a set of #probability #distribution

15320094700000000

#n_gram is a contiguous #sequence of n items from a given sample of #text or #speech.

15320145530000000

An #n_gram #model is a type of probabilistic language #model for predicting the next item in such a #sequence in the form of a (n − 1)–order #markov #model

15320148440000000

#rst in #text #summarization and other applications. #rst addresses #text organization by means of #relations that hold between parts of #text. It explains coherence by postulating a hierarchical, connected #structure of #texts

15320151960000000

#rhetorical_structure_theory #rst

15320152100000000

#semantic_folding theory describes a procedure for encoding the #semantics of natural #language #text in a semantically grounded #binary #representation.

15320153750000000

#semantic_compression is a process of compacting a #lexicon used to build a textual #document

15320154290000000

#tf_idf with #k_means #clustering,

15320170450000000

#k_means returns #cluster centroids as "#topics" and #lda assigns #words to the different #topics

15320170630000000

#topic_modeling through 4 of the most popular techniques today: #lsa, #plsa, #lda, and the newer, #deep_learning based #lda2vec. https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-#lda2vec-555ff65b0b05

15320172040000000

for #lsa we need a #corpus of #documents to analyze which #words belong to which #documents and to also weigh out the ones that are too frequently appearing in all the #texts

15320173120000000

This #dimensionality_reduction can be performed using truncated #svd so that #topic #document #matrix is converted into #term #topic #matrix

15320174940000000

#lda is a #bayesian version of #plsa.

15320176230000000

At the #word level, we typically use something like #word2vec to obtain #vector representations.

15320182830000000

#lda2vec is an extension of #word2vec and #lda that jointly learns #word, #document, and #topic_vectors

15320183080000000

× ⁝⁝

Total Nodes Shown:

extend

Recalculate Metrics Reset Filters

Reset to Default

Show Nodes with Degree > 0:

0 0

Total Nodes Shown:

extend

Filter Graphs:

Filter Time Range

Recalculate Metrics Reset Filters

Show Labels for Nodes > 0 size:

0 0

Default Label Size: 0

0 20

Edges Type:

Layout Type:

Reset to Default

× ⁝⁝

× ⁝⁝

Semantic Variability Score

— modulates diversity of the discourse network how it works? The score is calculated based on how modular the structure of the graph is (> 0.4 means the clusters are distinct and separate from one another = multiple perspectives). It also takes into account how the most influential nodes are dispersed among those clusters (higher % = lower concentration of power in a particular cluster).

Actionable Insight:

N/A

We distinguish 4 states of variability in your discourse. We recommend that a well-formed discourse should go through every stage during its evolution (in several iterations).

1 - (bottom left quadrant) — biased — low variability, low diversity, one central idea (genesis and introduction stage).

2 - (top right) - focused - medium variability and diversity, several concepts form a cluster (coherent communication stage).

3 - (bottom right) - diversified — there are several distinct clusters of main ideas present in text, which interact on the global level but maintain specificity (optimization and reflection stage).

4 - (left top) — dispersed — very high variability — there are disjointed bits and pieces of unrelated ideas, which can be used to construct new ideas (creative reformulation stage).

Read more in the cognitive variability help article.

1 - (bottom left quadrant) — biased — low variability, low diversity, one central idea (genesis and introduction stage).

2 - (top right) - focused - medium variability and diversity, several concepts form a cluster (coherent communication stage).

3 - (bottom right) - diversified — there are several distinct clusters of main ideas present in text, which interact on the global level but maintain specificity (optimization and reflection stage).

4 - (left top) — dispersed — very high variability — there are disjointed bits and pieces of unrelated ideas, which can be used to construct new ideas (creative reformulation stage).

Read more in the cognitive variability help article.

Shows to what extent you explored all the different states of the graph, from uniform and regular to fractal and complex. Read more in the cognitive variability help article.

You can increase the score by adding content into the graph (your own and AI-generated), as well as removing the nodes from the graph to reveal latent topics and hidden patterns.

You can increase the score by adding content into the graph (your own and AI-generated), as well as removing the nodes from the graph to reveal latent topics and hidden patterns.

× ⁝⁝

*please, add your data to display the stats...*+ ⤓ ? ⤓

*please, add your data to display the stats...*
+ ⤓ ↻ ?

Reset Graph Export: Show Options

*loading...*
⤓ ? ⤓

*loading...*

positive: | negative: | neutral:

⤓ ? ⤓

*please, select the node(s) on the graph see their connections...*
+ ?

?

Main Topical Clusters:

The topical clusters are comprised of the nodes (words) that tend to co-occur together in the same context (next to each other).

We use a combination of clustering and graph community detection algorithm (Blondel et al based on Louvain) to identify the groups of nodes are more densely connected together than with the rest of the network. They are aligned closer to each other on the graph using the Force Atlas algorithm (Jacomy et al) and are given a distinct color.

We use a combination of clustering and graph community detection algorithm (Blondel et al based on Louvain) to identify the groups of nodes are more densely connected together than with the rest of the network. They are aligned closer to each other on the graph using the Force Atlas algorithm (Jacomy et al) and are given a distinct color.

Most Influential Elements:

We use the Jenks elbow cutoff algorithm to select the top prominent nodes that have significantly higher influence than the rest.

Click the Reveal Non-obvious button to remove the most influential words (or the ones you select) from the graph, to see what terms are hiding behind them.

The most influential nodes are either the ones with the highest betweenness centrality — appearing most often on the shortest path between any two randomly chosen nodes (i.e. linking the different distinct communities) — or the ones with the highest degree.

Click the Reveal Non-obvious button to remove the most influential words (or the ones you select) from the graph, to see what terms are hiding behind them.

The most influential nodes are either the ones with the highest betweenness centrality — appearing most often on the shortest path between any two randomly chosen nodes (i.e. linking the different distinct communities) — or the ones with the highest degree.

Network Structure:

N/A

?The network structure indicates the level of its diversity. It is based on the modularity measure (>0.4 for medium, >0.65 for high modularity, measured with Louvain (Blondel et al 2008) community detection algorithm) in combination with the measure of influence distribution (the entropy of the top nodes' distribution among the top clusters), as well as the the percentage of nodes in the top community.

Reset Graph Export: Show Options

Action Advice:

N/A

Structural Gap

(ask a research question that would link these two topics):N/A

? A structural gap shows the two distinct communities (clusters of words) in this graph that are important, but not yet connected. That's where the new potential and innovative ideas may reside.

This measure is based on a combination of the graph's connectivity and community structure, selecting the groups of nodes that would either make the graph more connected if it's too dispersed or that would help maintain diversity if it's too connected.

This measure is based on a combination of the graph's connectivity and community structure, selecting the groups of nodes that would either make the graph more connected if it's too dispersed or that would help maintain diversity if it's too connected.

Latent Topical Brokers

(less visible terms that link important topics):
N/A

?These are the latent brokers between the topics: the nodes that have an unusually high rate of influence (betweenness centrality) to their freqency — meaning they may appear not as often as the most influential nodes but they are important narrative shifting points.

These are usually brokers between different clusters / communities of nodes, playing not easily noticed and yet important role in this network, like the "grey cardinals" of sorts.

These are usually brokers between different clusters / communities of nodes, playing not easily noticed and yet important role in this network, like the "grey cardinals" of sorts.

Emerging Keywords

N/A

Evolution of Topics

(number of occurrences per text segment) ?
↻

The chart shows how the main topics and the most influential keywords evolved over time. X-axis: time period (split into 10% blocks). Y-axis: cumulative number of occurrences.

Drag the slider to see how the narrative evolved over time. Select the checkbox to recalculate the metrics at every step (slower, but more precise).

Drag the slider to see how the narrative evolved over time. Select the checkbox to recalculate the metrics at every step (slower, but more precise).

Main Topics

(according to Latent Dirichlet Allocation):LDA stands for Latent Dirichlet Allocation — it is a topic modelling algorithm based on calculating the maximum probability of the terms' co-occurrence in a particular text or a corpus.

We provide this data for you to be able to estimate the precision of the default InfraNodus topic modeling method based on text network analysis.

We provide this data for you to be able to estimate the precision of the default InfraNodus topic modeling method based on text network analysis.

Most Influential Words

(main topics and words according to LDA):Sentiment Analysis

positive: | negative: | neutral:

⤓ ? ⤓

We analyze the sentiment of each statement to see whether it's positive, negative, or neutral. You can filter the statements by sentiment (clicking above) and see what kind of topics correlate with every mood.

The approach is based on AFINN and Emoji Sentiment Ranking

The approach is based on AFINN and Emoji Sentiment Ranking

Keyword Relations Analysis:

Top Relations / Bigrams

(both directions):
The most prominent relations between the nodes that exist in this graph are shown above. We treat the graph as undirected by default. Occurrences shows the number of the times a relationship appears in a 4-gram window. Weight shows the weight of that relation.

As an option, you can also downloaded directed bigrams above, in case the direction of the relations is important (for any application other than language).

As an option, you can also downloaded directed bigrams above, in case the direction of the relations is important (for any application other than language).

Text Statistics:

Word Count | Unique Lemmas | Characters | Lemmas Density |

0 |
0 |
0 |
0 |

Text Network Statistics:

The higher is the network's structure diversity and the higher is the alpha in the influence propagation score, the higher is its mind-viral immunity — that is, such network will be more resilient and adaptive than a less diverse one.

In case of a discourse network, high mind-viral immunity means that the text proposes multiple points of view and propagates its influence using both highly influential concepts and smaller, secondary topics.

In case of a discourse network, high mind-viral immunity means that the text proposes multiple points of view and propagates its influence using both highly influential concepts and smaller, secondary topics.

The higher is the diversity, the more distinct communities (topics) there are in this network, the more likely it will be pluralist.

The network structure indicates the level of its diversity. It is based on the modularity measure (>0.4 for medium, >0.65 for high modularity, measured with Louvain (Blondel et al 2008) community detection algorithm) in combination with the measure of influence distribution (the entropy of the top nodes' distribution among the top clusters), as well as the the percentage of nodes in the top community.

Modularity | 0 |
Influence Distribution | 0 % |
---|

Topics | Nodes in Top Topic | Components | Nodes in Top Comp |

0 |
0 % |
0 |
0 % |

Nodes | Av Degree | Density | Weighed Betweenness |

0 |
0 |
0 |
0 |

Narrative Influence Propagation:

? The chart above shows how influence propagates through the network. X-axis: lemma to lemma step (narrative chronology). Y-axis: change of influence.

The more even and rhythmical this propagation is, the stronger is the central idea or agenda (see alpha exponent below ~ 0.5 or less).

The more variability can be seen in the propagation profile, the less is the reliance on the main concepts (agenda), the stronger is the role of secondary topical clusters in the narrative.

The more even and rhythmical this propagation is, the stronger is the central idea or agenda (see alpha exponent below ~ 0.5 or less).

The more variability can be seen in the propagation profile, the less is the reliance on the main concepts (agenda), the stronger is the role of secondary topical clusters in the narrative.

We plot the narrative as a time series of influence (using the words' betweenness score). We then apply detrended fluctuation analysis to identify fractality of this time series, plotting the log2 scales (x) to the log2 of accumulated fluctuations (y). If the resulting loglog relation can be approximated on a linear polyfit, there may be a power-law relation in how the influence propagates in this narrative over time (e.g. most of the time non-influential words, occasionally words with a high influence).

Using the alpha exponent of the fit (which is closely related to Hurst exponent)), we can better understand the nature of this relation: uniform (pulsating | alpha <= 0.65), variable (stationary, has long-term correlations | 0.65 < alpha <= 0.85), fractal (adaptive | 0.85 < alpha < 1.15), and complex (non-stationary | alpha >= 1.15).

For maximal diversity, adaptivity, and plurality, the narrative should be close to "fractal" (near-critical state). For fiction, essays, and some forms of poetry — "uniform". Informative texts will often have "variable + stationary" score. The "complex" state is an indicator that the text is always shifting its state.

Using the alpha exponent of the fit (which is closely related to Hurst exponent)), we can better understand the nature of this relation: uniform (pulsating | alpha <= 0.65), variable (stationary, has long-term correlations | 0.65 < alpha <= 0.85), fractal (adaptive | 0.85 < alpha < 1.15), and complex (non-stationary | alpha >= 1.15).

For maximal diversity, adaptivity, and plurality, the narrative should be close to "fractal" (near-critical state). For fiction, essays, and some forms of poetry — "uniform". Informative texts will often have "variable + stationary" score. The "complex" state is an indicator that the text is always shifting its state.

Degree Distribution:

? (based on kolmogorov-smirnov test) ? switch to linear

Using this information, you can identify whether the network has scale-free / small-world (long-tail power law distribution) or random (normal, bell-shaped distribution) network properties.

This may be important for understanding the level of resilience and the dynamics of propagation in this network. E.g. scale-free networks with long degree tails are more resilient against random attacks and will propagate information across the whole structure better.

This may be important for understanding the level of resilience and the dynamics of propagation in this network. E.g. scale-free networks with long degree tails are more resilient against random attacks and will propagate information across the whole structure better.

If a power-law is identified, the nodes have preferential attachment (e.g. 20% of nodes tend to get 80% of connections), and the network may be scale-free, which may indicate that it's more resilient and adaptive. Absence of power law may indicate a more equalized distribution of influence.

Kolmogorov-Smirnov test compares the distribution above to the "ideal" power-law ones (^1, ^1.5, ^2) and looks for the best fit. If the value d is below the critical value cr it is a sign that the both distributions are similar.

Kolmogorov-Smirnov test compares the distribution above to the "ideal" power-law ones (^1, ^1.5, ^2) and looks for the best fit. If the value d is below the critical value cr it is a sign that the both distributions are similar.

Please, enter a search query to visualize the difference between what people search for (related queries) and what they actually find (search results):

Please, enter a search query to discover what else people are searching for (from Google search or AdWords suggestions):

Find a market niche for a certain product, category, idea or service: what people are looking for but cannot yet find*

Please, enter your query to visualize Google search results as a graph, so you can learn more about this topic:

Enter a search query to analyze the Twitter discourse around this topic (last 7 days):