#machine_learning #supervised #unsupervised

16010218210000000

× ⁝⁝

Graph Language Processing Settings:

× ⁝⁝

Dynamic Graph Settings

Play the Graph

0 2000

× ⁝⁝

PNG (Image) SVG (Hi-Res)

CSV (Spreadsheet) MD (e.g.Obsidian)

TXT Analytics Report CSV Report N-Grams CSV

JSON CSV Gexf (Gephi)

Download Plain Text (All Statements)

Export the Data

Network Graph Images:

PNG (Image) SVG (Hi-Res)

Visible Statements (Tagged):

CSV (Spreadsheet) MD (e.g.Obsidian)

Text Mining Analytics:

TXT Analytics Report CSV Report N-Grams CSV

Network Graph Data:

JSON CSV Gexf (Gephi)

All the Text (for backup and duplicating):

Download Plain Text (All Statements)

× ⁝⁝

Save This Graph View:

× ⁝⁝

Delete This Graph:

× ⁝⁝

Your Project Notes

Interpret graph data, save ideas, AI content, and analytics reports.
Add Analytics
Top keywords (global influence):

Top topics (local contexts):

Explore the main topics and terms outlined above or see them in the excerpts from this text below.

See the relevant data in context: click here to show the excerpts from this text that contain these topics below.

Tip: use the form below to save the most relevant keywords for this search query. Or start writing your content and see how it relates to the existing search queries and results.

Tip: here are the keyword queries that people search for but don't actually find in the search results.

#machine_learning #supervised #unsupervised

16010218210000000

#features are used in #machine_learning for #differentiation

16010218950000000

#features are used for #training in #machine_learning

16010219230000000

the #relationship is called a #model in #machine_learning

16010219690000000

#supervised #machine_learning finds #patterns between #data and #labels

16010220170000000

#patterns are used to make #predictions

16010220240000000

#supervised works with #labeled #data

16010220620000000

the goal of #unsupervised is to identify meaningful #patterns in #data

16010221780000000

sometimes a #model can find #patterns that represent #stereotypes of #bias

16010222420000000

#clustering is a type of #unsupervised #learning

16010222680000000

with #reinforcement_learning (RL) you set up a #model (called an #agent in RL) where it receives a #reward each time it performs well (#reward_function)

16010223910000000

a #shaped #reward increases in the states closer to the #goal #state

16010228960000000

a #sparse #reward is in the #goal #state only

16010229280000000

#positive_reinforcement is an important element of #reinforcement_learning

16010233020000000

if the #reward is providing the #features to the #model that could improve the performance

16010233700000000

#machine_learning #problems: #classification, #regression, #clustering, #association learning #structured_output #ranking

16010234640000000

#clustering is an #unsupervised #learning problem

16010235060000000

#regression requires labeled data — #supervised learning problem

16010235420000000

#classification requires a set of #labels - so it is #supervised

16010235640000000

a #neural_network works through #representations

16010236980000000

#machine_learning process as an #experiment where we run #test after test after test to converge on a workable #model

16010242270000000

A well-defined #problem has both #inputs and #outputs. #inputs are the #features. #outputs are the #labels to predict.

16011504310000000

#training means creating or #learning the #model

16011570040000000

#inference means applying the trained #model to #unlabeled #examples

16011570150000000

A #regression #model #predicts continuous #values.

16011570330000000

A #classification #model #predicts #discrete #values.

16011570470000000

#machine_learning needs to provide #decisions rather than just #predictions

16011573670000000

#labels are the #variables or #values for #predictions

16012012510000000

#features are #inputs #variables describing the #data

16012012690000000

#model_training is done by the #data which has #features and #labels so that it knows what #correlations to extract

16012013200000000

#model maps #examples to predicted #labels

16012013740000000

#loss_function is showing us the degree of #deviation of the #model #prediction from the real #values

16012017050000000

#loss_function can be a square #difference between the #prediction and the #labels

16012017430000000

#loss_function = #observation - #prediction

16012017860000000

#mean_square_error estimates the #deviation of the #loss_function - #prediction from #labels for each element on average

16012020980000000

#model_training is usually based on reducing the #loss in the #loss_function via #mean_square_error but not only

16012022160000000

the #gradient_descent approach is used to minimize the #loss_function

16012045160000000

the #learning_rate determines the size of the #gradient_descent

16012047110000000

#epoch is the number of #batches used in #machine_learning

16012051550000000

if the #training #loss is decreased we say that it's #converged

16012058970000000

#goldilocks #learning rate for a curve, where a #gradient descent reaches the minimum point in the fewest number of steps?

16025817060000000

In #supervised #learning, a machine #learning algorithm builds a model by examining many examples and attempting to find a model that minimizes #loss; this process is called empirical #risk_minimization.

16025818180000000

#loss is the penalty for a bad #prediction. That is, loss is a number indicating how bad the model's #prediction was on a single example.

16025818380000000

Mean square error (#mse) is the average squared #loss per example over the whole #dataset.

16025818610000000

#iterative #learning is used to decrease #loss over time by making small steps and receiving #feedback on the #outputs

16025819620000000

Usually, you #iterate until overall #loss stops changing or at least changes extremely slowly. When that happens, we say that the #model has #converged.

16025822530000000

#epoch represents a full training pass over the entire #dataset such that each example has been seen once. Thus, an #epoch represents N / #batch size training iterations, where N is the total number of examples.

16025843760000000

#learning_rate is a scalar used to #train a model via #gradient_descent. During each iteration, the #gradient_descent algorithm multiplies the #learning_rate by the #gradient. The resulting product is called the #gradient step. #learning_rate is a key #hyperparameter.

16025844630000000

#batch is a number of #examples used in an #iteration

16025845230000000

#anomalies in #features may indicate a potential problem in a #dataset - one should be more careful using that sort of data.

16025888810000000

#learning_rate specifies the size of the step, #batch specifies how many elements we take into the learning process, #epoch specify how many iterations we're going to have

16025892340000000

a #synthetic_feature is made out of several #features and may help #prediction

16025896150000000

a #correlation_matrix shows if there are any #correlations between the #features

16025898840000000

we take a #training_set from our #data and a #test_set then we train the model on the #training_set to see how well it #prediction on the #test_set

16025905230000000

An #overfit model gets a low #loss during training but does a poor job #predicting new data.

16026035820000000

#machine_learning 's goal is to predict well on new #data drawn from a (hidden) true #probability #distribution.

16026036480000000

The less complex a #machine_learning model, the more likely that a good #empirical result is not just due to the #peculiarities of the #sample.

16026036890000000

Partitioning a #data set into a #training_set and #test_set lets you judge whether a given #model will generalize well to new #data.

16026042880000000

#partitioning a #data set into a #training_set and a #test_set

16026044740000000

a #training_set can be split into a smaller #training_set, #test_set and a #validation_set so that the model can be trained better and no #overfitting occurs

16026049020000000

#features are very important for #machine_learning models

16026135580000000

#features engineering should remove #outliers (which might lead to the problem that a #model cannot #predicting outstanding events)

16026138050000000

#data_visualization is important for knowing #data and improving #machine_learning models

16026138800000000

#one_hot_encoding allows us to incorporate categorical #data into our #model

16026141480000000

for very large #values #sparse #representation is used

16026141940000000

#binning_values allows us to to simplify the #data and bring it to the #feature_vector using #one_hot_encoding

16026145190000000

#binning_values by #quantile ensures the number of #examples in each bucket is different

16026145840000000

#feature_crossing can be a powerful way to improve #prediction by combining the #data #features of the #dataset

16026156950000000

#feature_crossing is often done for #one_hot_encoding where multiple #features are crossed to produce interesting #feature_vector

16026185150000000

#feature_crossing is one learning strategy #neural_network is another

16026188030000000

minimize #loss_complexity, which is called #structural_risk_minimization allows to avoid #overfitting the #model

16026258790000000

the #loss_term, measures how well the #model fits the #data, and the #regularization term, which measures model #complexity.

16026275040000000

model #complexity as a function of the #weights of all the #features in the #model.

16026276470000000

model #complexity as a function of the total number of #features with nonzero #weights. (A later module covers this approach.)

16026276470000000

#model developers tune the overall impact of the #regularization term by multiplying its value #scalar known as #lambda (also called the #regularization rate).

16026692460000000

#regularization term, which measures model #complexity.

16026692710000000

#regularization is a technique used in an attempt to solve the #overfitting problem in statistical models.

16026697380000000

#regularization penalizes the #loss_function in that it pushes the model to give lower #weights to each parameter in the model

16026698200000000

a model is #learning the #weights for each of the #features as it is #training itself to minimize #loss and #complexity

16026713960000000

#regularization penalizes the #loss_function for too much #complexity (a high number of #features with nonzero #weights )

16026716020000000

#sigmoid_function maps the the output of the #linear_layer of a model trained with #logistic_regression between zero and one

16026720630000000

#logistic_regression returns a #probability for a #classification

16026750260000000

In order to map a #logistic_regression value to a binary category, you must define a #classification_threshold (also called the #decision_threshold) #threshold

16026750440000000

A #true_positive is when a model correctly #predicts the #positive_class (it made it a prediction and it was true). A #true_negative is when a model correctly #predicts a #negative_class

16026753330000000

A #false_positive is when a model incorrectly #predicts the #positive_class. A #false_negative is when a model incorrectly #predicts the #negative_class

16026753790000000

#accuracy of a model #prediction is the ratio of the #correct_predictions to the total number of #predictions

16026754770000000

#prediction #predicts

16026754870000000

#predicting the #prediction

16026755020000000

#accuracy then is a ratio of #true_positive plus #true_negative to the sum of all other predictions (#true_negative + #true_positive + #false_negative + #false_positive)

16026760070000000

#accuracy alone doesn't tell the full story when you're working with a #class_imbalanced_data_set, like this one, where there is a significant #disparity between the number of #positive and #negative #labels.

16026761660000000

#precision is a ratio of #true_positive to #total_positives claimed (#true_positive + #false_positive)

16026763640000000

#recall is the ratio of #true_positive to the #total_positives that really happend (#true_positive + #false_negative)

16026764590000000

#classification_threshold should strike a balance between #precision and #recall so that both values are at their maximum

16026766240000000

#precision is based on a #claim and #recall is based on #reality

16026767150000000

The #f1_score is the harmonic #mean of the #precision and #recall

16026767960000000

#roc_curve (receiver operating characteristic curve) is a graph showing the performance of a #classification model at all #classification thresholds: the #true_positive rate vs the #false_positive rate

16026890920000000

#auc provides an aggregate measure of performance across all possible #classification #threshold

16026891300000000

#true_positive rate is basically the #recall because it's the relation of #true_positive to the sum of #true_positive and #false_negative

16026894150000000

#false_positive is the reverse of that: what is the ratio of the #claim of the positive that is not true to #reality

16026894840000000

#logistic_regression #predictions should be #unbiased. That is: "average of #predictions" should ≈ "average of observations"

16026907140000000

#prediction_bias is a quantity that measures how far apart the #predictions are from the #observations

16026907520000000

A #z_score is the number of #standard_deviations from the #mean for a particular raw value

16027931560000000

#sparse_vector often contain many #dimensions. Creating a #feature_cross results in even more #dimensions. Which may lead to a higher use of #resources and #memory

16028434020000000

#sparse_vector often contain many #dimensions. Creating a #feature_cross results in even more #dimensions. Which may lead to a higher use of #resources and #memory

16028434020000000

in a high-dimensional #sparse_vector it is good to encourage as many #weights as possible to be zero, so that we reduce the #complexity of the #model and the toll on #resources

16028435460000000

#l2_regularization encourages #weights to be small, but doesn't force them to exactly zero

16028436030000000

#l2_regularization is a sum of #weights and it encourages them to be smaller to reduce the #complexity of the model

16028437430000000

#lambda is used together with #l2_regularization to reduce complexity without pushing the #weights too high

16028437990000000

#regularization and #l2_regularization

16028438170000000

#l2_regularization penalizes #weights square, while #l1_regularization penalizes the #weights

16028439830000000

if we then take a #derivative of #l1_regularization we will remove all the zero #values

16028440060000000

#neural_networks are a more sophisticated version of #feature_cross. In essence, neural networks do the appropriate #feature_crossing for you.

16028443800000000

"#nonlinear" means that you can't accurately predict a #label with a model of the form In other words, the "#decision_surface" is not a #line

16028447790000000

#feature_cross is one possible approach to modeling #nonlinear problems.

16028448130000000

a #linear model (#linear_layer) can be represented as a #graph: the #inputs are the #features and the #output is the weighed sum of the #inputs (sum of #weights)

16028449160000000

where a #linear model doesn't work we can use the #nonlinear

16028449330000000

a #hidden_layer is a weighed sum of the #Input #values

16028450810000000

a #hidden_layer is a combination of #inputs

16028451020000000

a #hidden_layer is still part of a #linear model

16028451240000000

a #linear model cannot serve #nonlinear problems (e.g. it cannot fit the #predictions to a curve or identify certain areas or complex #patterns

16028452470000000

that's why we want to introduce #nonlinear model - we do that by piping each #hidden_layer node through a #nonlinear function

16028452920000000

the #nonlinear function is called the #activation_function - this lets us model very complicated #relations between the #inputs and #outputs

16028453710000000

#sigmoid #nonlinear #activation_function converts the #weights sum to a value between 0 and 1

16028454550000000

#rectified #linear unit #activation_function (or #relu, for short) often works a little better than a smooth function like the #sigmoid as #relu helps add #nonlinear dynamics into the layers of the #inputs

16028455550000000

#redundancy can be important for a #neural_network as it increases the possibility of #feature_cross that is useful

16028461870000000

an extreme case of #overfitting is #memorizing in which case rather than learning the general #ground_truth the model starts to adapt to the peculiarities and specificities in the #training_set so it becomes less fit to detect the new #patterns in a new set of #data

16028462920000000

nother form of #regularization, called #dropout, is useful for neural networks. It works by randomly "dropping out" unit #activations in a network for a single gradient step.

16028687260000000

#multi_class #neural_networks help identify multiple #labels

16028691760000000

Given a #classification problem with N possible solutions, a #one_vs_all solution consists of N separate #binary #classifiers—one #binary classifier for each possible #outcome.

16028693940000000

#softmax extends the idea of #logistic_regression into a #multi_class world. That is, #softmax assigns decimal #probabilities to each class in a #multi_class problem.

16028695230000000

instead of the #binary answer in #one_vs_all layer, #softmax gives a #probability for each #outcome

16028696060000000

for example, a #number #classification problem is a #multi_class #classification problem with 10 output classes, one for each digit.

16028708920000000

#collaborative_filtering is the task of making #predictions about the #interests of a user based on #interests of many other #users.

16028734010000000

an #embedding_space maps data by its #features so that the items that are more #similar (or are more likely to be used together) are closer to each other in the #space.

16028752990000000

the #embedding_space can consist of many #dimensions and some of them might not have the exact semantic #meanings in which case they are called #latent_dimension representing a #feature that is not explicit in the #data but is rather inferred from it.

16028754420000000

ultimately it's the distances between the data elements that are important in #embedding_space not the actual #values.

16028754880000000

#categorical_data refers to input #features that represent one or more discrete items from a finite set of choices. For example, it can be the set of movies a user has watched, the set of words in a document, or the occupation of a person. #categorical_data is most efficiently represented via #sparse_tensors which are tensors with very few non-zero elements (also see #sparse_vector)

16028756220000000

in order to use #sparse_vector #representations within a machine learning system, we need a way to represent each #sparse_vector as a vector of #numbers so that semantically #similar items (movies or words) have #similar distances in the vector space. But how do you represent a word as a vector of #numbers?

16028757450000000

for example, in #one_hot_encoding you would map one of the 500 000 words in a vocabulary as a #sparse_vector where item 2019 would be 1 and the rest are zero (the vector represents the word)

16028758160000000

a "#bag_of_words" #representation contains chunks of words in a #sparse_vector so several values are 1 and most are zero

16028758770000000

#embeddings translate large #sparse_vector into a lower-dimensional #space that preserves #semantic relationships.

16028760620000000

An #embeddings is a #matrix in which each column is the #vector that corresponds to an item in your #vocabulary. To get the dense #vector for a single #vocabulary item, you retrieve the column corresponding to that item.

16028762170000000

#principal_component_analysis (PCA) has been used to create #word #embeddings. Given a set of instances like #bag_of_word #vectors, PCA tries to find highly correlated #dimensions that can be collapsed into a single #dimension.

16028763630000000

#word2vec is an example of representing the language as a #sparse_vector — mapping semantically #similar words to geometrically close #embedding #vectors

16028765650000000

#word2_vec exploits contextual information like this by training a neural net to distinguish actually co-occurring groups of #words from randomly grouped #words. The #input layer takes a #sparse_vector representation of a target word together with one or more context words.

16028766440000000

#static_model is trained #offline a #dynamic_model is trained #online

16028768920000000

#bias arises when we don't include what we consider to be #typical in a set of #features

16028777890000000

#reporting_bias occurs when the #frequency of events, properties, and/or outcomes captured in a #data set does not accurately reflect their real-world #frequency.

16028782390000000

#automation_bias is a tendency to favor results generated by #automated systems over those generated by #humans

16028782650000000

#selection_bias occurs if a #data set's examples are chosen in a way that is not reflective of their real-world #distribution

16028782910000000

#confirmation_bias, where model builders unconsciously process data in ways that affirm preexisting #beliefs and #hypotheses. In some cases, a model builder may actually keep training a model until it produces a result that aligns with their original hypothesis; this is called #experimenters_bias

16028784240000000

[bias]

#confirmation_bias #experimenters_bias #selection_bias #automation_bias #reporting_bias are all times of @bias

16028784540000000

#confusion_matrix summarizes how successful #predictions are (it has #precision / #recall #matrix)

16028786790000000

in order to avoid #bias it is important to also test it across the #categorical_data (eg. only for men only for women) with #recall #precision or #accuracy in order to see if it's biased towards a certain category

16028787320000000

#detrended_fluctuation_analysis or #dfa is a method for determining the statistical #self_affinity of a #signal. It is useful for analysing #time_series that appear to be long-memory processes (diverging correlation time, e.g. #power_law decaying autocorrelation function) or #1f_noise.

16029688790000000

The obtained #exponent is similar to the #hurst_exponent, except that #dfa may also be applied to signals whose underlying statistics (such as #mean and #variance) or dynamics are #non_stationary (changing with time)

16029689080000000

In #dfa the scaling exponent #alpha is calculated as the #slope of a straight line fit to the log #log graph of F(n)}F(n) using leas #squares. an exponent of 0.5 would correspond to #uncorrelated #white_noise, an exponent of 1 is #pink_noise

16029690580000000

Another way to detect #pink_noise is to build a graph where the x axis are the #events while the y axis records a #time_series estimation relative to the #standard_deviation from the #average (#mean) time interval.

16029693990000000

At its essence #pink_noise is based on #self_affinity and #self_similarity, so that no matter what scale you look at, the pattern is #similar (#scale_free)

16029694710000000

#power_spectral_analysis describes distribution of #power across #frequency components composing the #signal - for #pink_noise we have a 1/f relationship — few powerful signals with low frequency, a long tail of less powerful ones (of which there are many) (hence #1f_noise)

16029700860000000

#envelope is a smooth #curve outlining the extremes of a #signal and it is also calculated in #hilbert_transform, which, in turn is used in calculating #dfa or #detrended_fluctuation_analysis

16048776010000000

#detrended_fluctuation_analysis (#dfa) has proven particularly useful, revealing that genetic #variation, normal development, or #disease can lead to differences in the #scale_free #amplitude #modulation of oscillations https://www.frontiersin.org/articles/10.3389/fphys.2012.00450/full

16049192880000000

The reason why #chaotic #variation (#pink_noise) is indicative of a #healthy state is because it reflects #winnerless_competition behind the process. If there's a deviation in this dynamics (eg some #patterns), it could mean that one process is #dominating the rest.

16049193600000000

#self_affinity is a property of #fractal #time_series where the small parts of the whole are #similar to the whole

16049202650000000

#self_affinity processes and #self_similar structures have in common that the statistical #distribution of the measured quantity follows a #power_law function, which is the only mathematical function without a characteristic scale. Self-affine and #self_similar phenomena are therefore called "#scale_free.”

16049204010000000

In #power_law #distribution the #mean would not necessarily be the same as the #median (which is are closer to each other in #normal #distribution)

16049204950000000

A #power_law #distribution means that there is big number of #small #variation and a small number of #big #variation (hence the line with a negative #slope when expressed as a #log)

16049207830000000

In a #1f #signal the lower #frequency objects have larger #amplitude than the higher #frequency objects (#1f_noise) https://www.frontiersin.org/files/Articles/23105/fphys-03-00450-HTML/image_m/fphys-03-00450-g001.jpg

16049214350000000

the #frequency of a certain #size of flower being inversely #proportional to its #size.

16049215210000000

#time_series in which all #frequency are represented with the same #amplitude will lack the rich variability of the #scale_free #time_series and is referred to as "#white_noise”

16049216050000000

To estimate the #scale_free property we calculate the #standard_deviation (#signal in relation to #mean) over the differently sized #time_windows. If as the #time_windows size increases the #standard_deviation also increases, we're dealing with a #scale_free process. If the #scaling_effect is not there, then it's not a scale free process.

16049220900000000

a stationary #random #fluctuating process has a #signal profile, which is #self_affine with a #scaling_exponent α = 0.5

16049249650000000

when we add #memory in the sense that the #probability of an action depends on the previous actions that the walker has made — we will get a process that will exhibit #self_affinity across scales (#scale_free)

16049250440000000

Different classes of processes with #memory exist: #positive_correlation and those with #anti_correlation - anti-correlations can be seen as a #stabilizing mechanism - a future action is more likely to be opposite than the ones made before. In this case on longer windows (time scales) we will have lower #fluctuating so the coefficient will be lower (α 0 to 0.5) - has #memory, #anti_correlation. 0.5 - #random, 0.5 to 1 - has #memory and #positive_correlation (previous actions increase the likelyhood of that action taken again) https://www.frontiersin.org/files/Articles/23105/fphys-03-00450-HTML/image_m/fphys-03-00450-g003.jpg

16049257140000000

for #dfa the signal is transformed into the #cumulative_signal, then it is split into several #windows equal in size on the #log scale. then for each the data is #detrended and #standard_deviation is calculated for each #window. then #fluctuating function is calculated as the mean #standard_deviation for all the #windows. Then we plot that as a graph on #log scales. The #dfa exponent α is the #slope of the trend. If it follows a straight line 45° then it means that with every #window increase we do not have a #proportional increase in the mean of fluctuation (so it is #linear). If it is more, then it is #non_linear and shows that it is in fact #scale_free

16049268440000000

The lower end of the fitting range is at least four samples, because #linear #detrending will perform poorly with less points (Peng et al., 1994). For the high end of the fitting range, #dfa estimates for window sizes >10% of the #signal length are more noisy due to a low number of windows available for averaging (i.e., less than 10 windows). Finally, the 50% overlap between windows is commonly used to increase the number of windows, which can provide a more accurate estimate of the fluctuation function especially for the long-time-scale windows.

16049270000000000

A #brown_noise process can be obtained by successively summing data points in the #white_noise process. https://www.researchgate.net/publication/232236967_A_tutorial_introduction_to_adaptive_fractal_analysis/figures?lo=1

16190243450000000

Using the classical #dfa method, the #cumulative_sum of data are divided into segments, and the #variance of these sums is studied as a function of segment length after linearly detrending them in each segment. https://www.nature.com/articles/s41598-019-42732-7

16190250690000000

In #dfa, data are divided into segments of length L and are #linearly detrended. The #square_root of the #variance (called #fluctuation) of the detrended data is studied as a function of L. It can be shown that a #linear relationship between the #logarithm of the #fluctuation and the #logarithm of L is indicative of a #power_law behavior of the spectrum. https://www.nature.com/articles/s41598-019-42732-7

16190255840000000

If a #linear relationship between the length of a #segment or #time_windows and the strength of the #fluctuation (or the #square_root of the #variance of the #cumulative_signal) exists, the slope of the corresponding line is also referred to as #hurst_exponent.

16190283930000000

For #white_noise the #hurst_exponent or the relation between the #time_windows and the #fluctuation (square root of #variance) will be #linear: when we double the #time_windows the #fluctuation (or #variance of the #cumulative_sum) will also double.

16190320820000000

For #pink_noise #1f_noise the #hurst_exponent will equal #1 and will mean that for #time_windows twice longer the #fluctuation will increase about 4 times. In other words, the the longer is the #time_windows the more #fluctuation occurs (#positive_correlation).

16190325510000000

#hurst_exponent in this context is #alpha_exponent, because we use #alpha_exponent for #non_stationary processes

16190326350000000

if #alpha_exponent is more than 1, it means that for every increase of scale (#time_windows) the cumulative_sum of #fluctuation increases a lot. That means, the longer we look at the process, the more likely it is to have big #fluctuation — there is a tendency in the #short_term to be #small and in the #long_term there's a tendency to be #big.

16190328080000000

the #cumulative_sum of the difference from the #average of a #time_series will be #brown_noise (#random_walk) for the #white_noise

16190339070000000

[accessed_apr_21_2021]

In contrast, #0.5 < #hurst_exponent < #1 indicates a #correlated process for #f_gn or what is termed a #persistent process for #f_bm. In this case, #increases in the signal (for #f_gn) or increments of the signal (for #f_bm) are likely to followed by further #increase, and #decrease are likely to be followed by #decreases (i.e., a #positive #long_term #correlation). Anti-#persistent and #persistent processes contain #structure that distinguishes them from truly #random sequences of data. (2) (PDF) A tutorial introduction to adaptive fractal analysis. Available from: https://www.researchgate.net/publication/232236967_A_tutorial_introduction_to_adaptive_fractal_analysis [accessed Apr 21 2021].

16190379890000000

The difference between the #exponent or #exponential_decay and the #power_law #decay is that #power_law #decay is slower: there are more values with a low #amplitude in the case of the #power_law https://math.stackexchange.com/questions/164436/difference-between-power-law-distribution-and-exponential-decay

16195411410000000

#downsampling (in this context) means #training on a disproportiona#tely_low_subset_of_the_#majority_class examples.

16209001190000000

#up#weighting means adding an example #weight to the downsampled class equal to the factor by which you performed #downsampling.

16209001560000000

#normalizing - transforming #numeric data to the same #scale as other #numeric data.

16209003410000000

#bucketing - transforming #numeric (usually #continuous) #data to #categorical_data.

16209003590000000

#scaling means converting #floating_point #feature #values from their #natural #range (for example, 100 to 900) into a #standard #range—usually 0 and 1

16209005210000000

If your data set contains extreme #outliers, you might try #feature_clipping, which caps all feature #values above (or below) a certain value to fixed value. https://developers.google.com/machine-learning/data-prep/transform/normalization

16209005990000000

#log #scaling computes the #log of your values to compress a wide #range to a narrow #range. #log_scaling is helpful when a handful of your values have many points, while most other values have few points. This data #distribution is known as the #power_law #distribution. Movie ratings are a good example. In the chart below, most movies have very few ratings (the data in the tail), while a few have lots of ratings (the data in the head). #log_scaling changes the #distribution, helping to improve linear model performance.

16209006980000000

#z_score is a #variation of #scaling that represents the number of #standard_deviations away from the #mean. You would use z-score to ensure your #feature distributions have #mean = 0 and std = 1. It’s useful when there are a few #outliers, but not so extreme that you need #clipping.

16209007620000000

#transformation of #numeric #features into #categorical #features, using a set of #thresholds, is called #bucketing (or #binning) - creating #buckets

16209010880000000

creating #buckets that each have the same number of points. This technique is called #quantile_bucketing.

16209023430000000

when we represent a #categorical #value with a #number it's called a #vocabulary

16209035000000000

#one_hot_encoding represents #numeric #values as #vectors - which can then be further compressed with #sparse_vector

16209035990000000

#grouping #un#labeled #examples is called #clustering. As the examples are un#labeled, #clustering relies on #unsupervised #machine_learning. If the examples are #labeled, then #clustering becomes #classification.

16209055020000000

#hierarchical_clustering creates a tree of #clusters. #hierarchical #clustering, not surprisingly, is well suited to #hierarchical #data, such as #taxonomies.

16209067630000000

#distribution_based_#clustering This #clustering approach assumes #data is composed of #distributions, such as#gaussian_#distributions. Then it #clusters them accordingly.

16209068320000000

#density_based_clustering connects areas of high example #density into #clusters. This #clustering allows for arbitrary-shaped #distributions as long as dense areas can be connected. These algorithms have difficulty with data of varying #densities and high #dimensions and also with #outliers.

16209068870000000

#centroid_based_#clustering organizes the data into #non_#hierarchical_clusters, in contrast to #hierarchical #clustering defined below. #k_means is the most widely-used centroid-based #clustering #algorithm.

16209069360000000

In order to perform #clustering we need to quantify the #similarity between examples by creating the #similarity_metrics for our #dataset

16209070400000000

for #data #processing we need to create #quantile or use #quantile_bucketing when the #distribution is #poisson — neither #gaussian nor #power_law.

16209071750000000

when the #distribution is #gaussian we can #normalizing our #data

16209072190000000

when the #distribution is #power_law we might want to use #log_scaling #normalizing for our data

16209072510000000

we can do either #manual #similarity or #supervised #similarity. you switch to a #supervised_similarity_measure when you have trouble creating a #manual_similarity_measure.

16209074210000000

#mean_square_error shows the #average squared #loss for an #example

16209075920000000

we can calculate #similarity by calculating a root #mean_square_error or the #sums of the #features (e.g. size and price). the lower the value, the higher is the similarity.

16209117140000000

For #categorical_data we can calculate #similarity using #jaccard_similarity which shows the proportion of intersection between the #sets

16209118550000000

#k_means groups points into #clusters by minimizing the #distances between points and their #cluster’s #centroid (as seen in Figure 1 below). The #centroid of a #cluster is the #mean of all the points in the #cluster.

16209172940000000

Instead of comparing manually-combined #feature #data, you can reduce the #feature #data to #representations called #embeddings, and then compare the #embeddings

16209181460000000

#embeddings are generated by training a #supervised deep neural network (#dnn) on the #feature data itself. The #embeddings_map the #feature data to a #vector in an #embedding_space. Typically, the #embedding_space has fewer dimensions than the #feature data in a way that captures some #latent #structure of the #feature data set.

16209184210000000

A #dnn that learns #embeddings of #input data by predicting the #input data itself is called an #autoencoder. An #autoencoder is the simplest choice to generate #embeddings. However, an #autoencoder isn't the optimal choice when certain features could be more important than others in determining #similarity.

16209185270000000

Since this #dnn predicts a specific input #feature instead of predicting all input #features, it is called a predictor #dnn

16209185550000000

To train the #dnn, you need to create a #loss_functionby following these steps: 1) calculate the #loss for every #output of #dnn. For #numeric outputs use #mean_square_error, for #categorical use #log_loss, for #multivalent #categorical use #softmax_cross_entropy (#entropy) loss.

16209187000000000

in #poisson distribution the #decay happens much faster than in the #power_law #distribution — if in #power_law you have a significant number of nodes the #tail, then in #poisson you only have a few.

16217118970000000

A #similarity measure takes the

16219366980000000

#embeddings generated by our neural network (#dense_features) and returns a number measuring their #similarity.

16219366980000000

To calculate #similarity we have 3 measures to choose from: #euclidian_distance (substraction of vectors), #cosine_distance (cosine of the angle between the vectors) and the #dot_product (cosine multiplied by the lengths of both vectors)

16219369490000000

In contrast to the #cosine_distance, the #dot_product is proportional to the #vector #length. This is important because examples that appear very frequently in the training set (for example, popular YouTube videos) tend to have embedding #vectors with large #lengths. If you want to capture #popularity, then choose #dot_product.

16219370590000000

#cluster #cardinality is the number of examples per #cluster. We are looking for #outliers and if we do find them, this may indicate some interesting #patterns

16219382860000000

Cluster #magnitude is the sum of #distances from all examples to the #centroid of the #cluster. Similar to #cardinality, check how the #magnitude varies across the #clusters, and investigate #anomalies. and #outliers

16219383310000000

Notice that a higher #cluster #cardinality tends to result in a higher #cluster #magnitude, which intuitively makes sense. Clusters are #anomalous when #cardinality doesn't correlate with #magnitude relative to the other #clusters.

16219388730000000

#content_based_filtering Uses #similarity between items to #recommend items similar to what the user likes.

16219394200000000

#collaborative_filtering uses Uses #similarity between #queries and #items simultaneously to provide #recommendations.

16219394430000000

Both #content_based_filtering and #collaborative_filtering map each #item and each #query (or #context) to an #embedding_vector

16219395220000000

#recommendations - We again place our #users in the same #embedding_space to best explain the #feedback_matrix: for each (#user, #item) pair, we would like the #dot_product of the #user #embedding and the #item #embedding to be close to 1 when the #user watched the movie, and to 0 otherwise.

16219406780000000

The #dot_product of the #user_#matrix and #item_#matrix yields a #recommendation #matrix that contains not only the original user ratings but also #predictions for the movies that each user hasn't seen

16219414650000000

#matrix_factorization In math, a mechanism for finding the matrices whose #dot_product approximates a #target_matrix.

16219414960000000

#generative_adversarial_networks (#gans) are an exciting recent innovation in #machine_learning. #gans are #generative models: they create new #data_instances that resemble your #training_data. For example, #gans can create images that look like photographs of human faces, even though the faces don't belong to any real person.

16219422540000000

#gans achieve this level of realism by pairing a #generator, which learns to produce the target output, with a #discriminator, which learns to distinguish true data from the output of the #generator. The #generator tries to fool the #discriminator, and the #discriminator tries to keep from being fooled.

16219425150000000

#generative" describes a class of statistical models that contrasts with #discriminative models. #generative models can generate new #data #instances. #discriminative models #discriminate between different kinds of #data #instances.

16219446830000000

More formally, given a set of #data #instances X and a set of #labels Y: #generative models capture the joint #probability p(X, Y), or just p(X) if there are no #labels. #discriminative models capture the #conditional #probability p(Y | X).

16219447850000000

The #generator learns to generate #plausible #data. The generated instances become negative #training #examples for the #discriminator.

16219452940000000

The #discriminator learns to distinguish the #generator's #fake #data from #real #data. The #discriminator #penalizes the #generator for producing implausible results.

16219453280000000

Through #backpropagation, the #discriminator s #classification provides a signal that the #generator uses to update its #weights.

16219453920000000

The #discriminator in a #gan is simply a #classifier. It tries to distinguish real #data from the #fake #data created by the #generator.

16220231030000000

The #discriminator connects to two #loss functions. During #discriminator training, the #discriminator ignores the #generator #loss and just uses the #discriminator #loss.

16220237690000000

The #generator part of a #gan learns to create #fake data by incorporating #feedback from the #discriminator. It learns to make the #discriminator #classify its output as real.

16220238440000000

The #generator feeds into the #discriminator net, and the #discriminator produces the output we're trying to affect. The #generator #loss penalizes the #generator for producing a sample that the #discriminator network classifies as #fake.

16220239740000000

#research has suggested that if your #discriminator is too good, then #generator training can fail due to #vanishing_gradients. In effect, an optimal #discriminator doesn't provide enough #information for the #generator to make #progress.

16220244120000000

#wasserstein_loss: The #wasserstein_loss is designed to prevent #vanishing_gradients even when you train the #discriminator to #optimality.

16220244420000000

#convolutional_neural_network (#cnn) could be used to progressively extract higher- and higher-level #representations of the image #content.

16220254470000000

#backpropagation is a process of calculating the #gradient for the #neural_network. it is used to see how to bring the #loss_function to the minimum, in which direction the learning should take place.

16223812730000000

in the process of #learning each #layer of a #neural_network will have #neurons with #weights ascribed to them, which enable #differentiation of various #features

16223813890000000

The #weights of the neurons combine through #layer using the #activation_function (e.g. #sigmoid or #relu), which then leads to only a certain neuron at the last layer to get activated. We can say that first #layer detects some general #features, the next one — more specific ones, and so on. but this is not the case.

16223816610000000

× ⁝⁝

Total Nodes Shown:

extend

Recalculate Metrics Reset Filters

Reset to Default

Show Nodes with Degree > 0:

0 0

Total Nodes Shown:

extend

Filter Graphs:

Filter Time Range

Recalculate Metrics Reset Filters

Show Labels for Nodes > 0 size:

0 0

Default Label Size: 0

0 20

Edges Type:

Layout Type:

Reset to Default

× ⁝⁝

× ⁝⁝

Semantic Variability Score

— modulates diversity of the discourse network how it works? The score is calculated based on how modular the structure of the graph is (> 0.4 means the clusters are distinct and separate from one another = multiple perspectives). It also takes into account how the most influential nodes are dispersed among those clusters (higher % = lower concentration of power in a particular cluster).

Actionable Insight:

N/A

We distinguish 4 states of variability in your discourse. We recommend that a well-formed discourse should go through every stage during its evolution (in several iterations).

1 - (bottom left quadrant) — biased — low variability, low diversity, one central idea (genesis and introduction stage).

2 - (top right) - focused - medium variability and diversity, several concepts form a cluster (coherent communication stage).

3 - (bottom right) - diversified — there are several distinct clusters of main ideas present in text, which interact on the global level but maintain specificity (optimization and reflection stage).

4 - (left top) — dispersed — very high variability — there are disjointed bits and pieces of unrelated ideas, which can be used to construct new ideas (creative reformulation stage).

Read more in the cognitive variability help article.

1 - (bottom left quadrant) — biased — low variability, low diversity, one central idea (genesis and introduction stage).

2 - (top right) - focused - medium variability and diversity, several concepts form a cluster (coherent communication stage).

3 - (bottom right) - diversified — there are several distinct clusters of main ideas present in text, which interact on the global level but maintain specificity (optimization and reflection stage).

4 - (left top) — dispersed — very high variability — there are disjointed bits and pieces of unrelated ideas, which can be used to construct new ideas (creative reformulation stage).

Read more in the cognitive variability help article.

Shows to what extent you explored all the different states of the graph, from uniform and regular to fractal and complex. Read more in the cognitive variability help article.

You can increase the score by adding content into the graph (your own and AI-generated), as well as removing the nodes from the graph to reveal latent topics and hidden patterns.

You can increase the score by adding content into the graph (your own and AI-generated), as well as removing the nodes from the graph to reveal latent topics and hidden patterns.

× ⁝⁝

*please, add your data to display the stats...*+ ⤓ ? ⤓

*please, add your data to display the stats...*
+ ⤓ ↻ ?

Download: TXT Report CSV Report More Options

*loading...*
⤓ ? ⤓

*loading...*

positive: | negative: | neutral:

⤓ ? ⤓

*please, select the node(s) on the graph see their connections...*
+ ?

?

Main Topical Clusters:

The topical clusters are comprised of the nodes (words) that tend to co-occur together in the same context (next to each other).

We use a combination of clustering and graph community detection algorithm (Blondel et al based on Louvain) to identify the groups of nodes are more densely connected together than with the rest of the network. They are aligned closer to each other on the graph using the Force Atlas algorithm (Jacomy et al) and are given a distinct color.

We use a combination of clustering and graph community detection algorithm (Blondel et al based on Louvain) to identify the groups of nodes are more densely connected together than with the rest of the network. They are aligned closer to each other on the graph using the Force Atlas algorithm (Jacomy et al) and are given a distinct color.

Most Influential Elements:

We use the Jenks elbow cutoff algorithm to select the top prominent nodes that have significantly higher influence than the rest.

Click the Reveal Non-obvious button to remove the most influential words (or the ones you select) from the graph, to see what terms are hiding behind them.

The most influential nodes are either the ones with the highest betweenness centrality — appearing most often on the shortest path between any two randomly chosen nodes (i.e. linking the different distinct communities) — or the ones with the highest degree.

Click the Reveal Non-obvious button to remove the most influential words (or the ones you select) from the graph, to see what terms are hiding behind them.

The most influential nodes are either the ones with the highest betweenness centrality — appearing most often on the shortest path between any two randomly chosen nodes (i.e. linking the different distinct communities) — or the ones with the highest degree.

Network Structure:

N/A

?The network structure indicates the level of its diversity. It is based on the modularity measure (>0.4 for medium, >0.65 for high modularity, measured with Louvain (Blondel et al 2008) community detection algorithm) in combination with the measure of influence distribution (the entropy of the top nodes' distribution among the top clusters), as well as the the percentage of nodes in the top community.

Download: TXT Report CSV Report More Options

Discourse Structure Advice:

N/A

Structural Gap Insight

(topics that could be better linked):N/A

? A structural gap shows the two distinct communities (clusters of words) in this graph that are important, but not yet connected. That's where the new potential and innovative ideas may reside.

This measure is based on a combination of the graph's connectivity and community structure, selecting the groups of nodes that would either make the graph more connected if it's too dispersed or that would help maintain diversity if it's too connected.

This measure is based on a combination of the graph's connectivity and community structure, selecting the groups of nodes that would either make the graph more connected if it's too dispersed or that would help maintain diversity if it's too connected.

Latent Topical Connectors

(less visible terms that link important topics):
N/A

? These are the latent brokers between the topics: the nodes that have an unusually high rate of influence (betweenness centrality) to their freqency — meaning they may appear not as often as the most influential nodes but they are important narrative shifting points.

These are usually brokers between different clusters / communities of nodes, playing not easily noticed and yet important role in this network, like the "grey cardinals" of sorts.

These are usually brokers between different clusters / communities of nodes, playing not easily noticed and yet important role in this network, like the "grey cardinals" of sorts.

Emerging Keywords

N/A

Evolution of Topics

(number of occurrences per text segment) ?
↻

The chart shows how the main topics and the most influential keywords evolved over time. X-axis: time period (split into 10% blocks). Y-axis: cumulative number of occurrences.

Drag the slider to see how the narrative evolved over time. Select the checkbox to recalculate the metrics at every step (slower, but more precise).

Drag the slider to see how the narrative evolved over time. Select the checkbox to recalculate the metrics at every step (slower, but more precise).

Main Topics

(according to Latent Dirichlet Allocation):LDA stands for Latent Dirichlet Allocation — it is a topic modelling algorithm based on calculating the maximum probability of the terms' co-occurrence in a particular text or a corpus.

We provide this data for you to be able to estimate the precision of the default InfraNodus topic modeling method based on text network analysis.

We provide this data for you to be able to estimate the precision of the default InfraNodus topic modeling method based on text network analysis.

Most Influential Words

(main topics and words according to LDA):Sentiment Analysis

positive: | negative: | neutral:

⤓ ? ⤓

We analyze the sentiment of each statement to see whether it's positive, negative, or neutral. You can filter the statements by sentiment (clicking above) and see what kind of topics correlate with every mood.

The approach is based on AFINN and Emoji Sentiment Ranking

The approach is based on AFINN and Emoji Sentiment Ranking

Keyword Relations Analysis:

Top Relations in 4-grams

(bidirectional, for directional bigrams see the CSV table below):
The most prominent relations between the nodes that exist in this graph are shown above. We treat the graph as undirected by default. Occurrences shows the number of the times a relationship appears in a 4-gram window. Weight shows the weight of that relation.

As an option, you can also downloaded directed bigrams above, in case the direction of the relations is important (for any application other than language).

As an option, you can also downloaded directed bigrams above, in case the direction of the relations is important (for any application other than language).

Text Statistics:

Word Count | Unique Lemmas | Characters | Lemmas Density |

0 |
0 |
0 |
0 |

Text Network Statistics:

The higher is the network's structure diversity and the higher is the alpha in the influence propagation score, the higher is its mind-viral immunity — that is, such network will be more resilient and adaptive than a less diverse one.

In case of a discourse network, high mind-viral immunity means that the text proposes multiple points of view and propagates its influence using both highly influential concepts and smaller, secondary topics.

In case of a discourse network, high mind-viral immunity means that the text proposes multiple points of view and propagates its influence using both highly influential concepts and smaller, secondary topics.

The higher is the diversity, the more distinct communities (topics) there are in this network, the more likely it will be pluralist.

The network structure indicates the level of its diversity. It is based on the modularity measure (>0.4 for medium, >0.65 for high modularity, measured with Louvain (Blondel et al 2008) community detection algorithm) in combination with the measure of influence distribution (the entropy of the top nodes' distribution among the top clusters), as well as the the percentage of nodes in the top community.

Modularity | 0 |
Influence Distribution | 0 % |
---|

Topics | Nodes in Top Topic | Components | Nodes in Top Comp |

0 |
0 % |
0 |
0 % |

Nodes | Av Degree | Density | Weighed Betweenness |

0 |
0 |
0 |
0 |

Narrative Influence Propagation:

? The chart above shows how influence propagates through the network. X-axis: lemma to lemma step (narrative chronology). Y-axis: change of influence.

The more even and rhythmical this propagation is, the stronger is the central idea or agenda (see alpha exponent below ~ 0.5 or less).

The more variability can be seen in the propagation profile, the less is the reliance on the main concepts (agenda), the stronger is the role of secondary topical clusters in the narrative.

The more even and rhythmical this propagation is, the stronger is the central idea or agenda (see alpha exponent below ~ 0.5 or less).

The more variability can be seen in the propagation profile, the less is the reliance on the main concepts (agenda), the stronger is the role of secondary topical clusters in the narrative.

We plot the narrative as a time series of influence (using the words' betweenness score). We then apply detrended fluctuation analysis to identify fractality of this time series, plotting the log2 scales (x) to the log2 of accumulated fluctuations (y). If the resulting loglog relation can be approximated on a linear polyfit, there may be a power-law relation in how the influence propagates in this narrative over time (e.g. most of the time non-influential words, occasionally words with a high influence).

Using the alpha exponent of the fit (which is closely related to Hurst exponent)), we can better understand the nature of this relation: uniform (pulsating | alpha <= 0.65), variable (stationary, has long-term correlations | 0.65 < alpha <= 0.85), fractal (adaptive | 0.85 < alpha < 1.15), and complex (non-stationary | alpha >= 1.15).

For maximal diversity, adaptivity, and plurality, the narrative should be close to "fractal" (near-critical state). For fiction, essays, and some forms of poetry — "uniform". Informative texts will often have "variable + stationary" score. The "complex" state is an indicator that the text is always shifting its state.

Using the alpha exponent of the fit (which is closely related to Hurst exponent)), we can better understand the nature of this relation: uniform (pulsating | alpha <= 0.65), variable (stationary, has long-term correlations | 0.65 < alpha <= 0.85), fractal (adaptive | 0.85 < alpha < 1.15), and complex (non-stationary | alpha >= 1.15).

For maximal diversity, adaptivity, and plurality, the narrative should be close to "fractal" (near-critical state). For fiction, essays, and some forms of poetry — "uniform". Informative texts will often have "variable + stationary" score. The "complex" state is an indicator that the text is always shifting its state.

Degree Distribution:

? (based on kolmogorov-smirnov test) ? switch to linear

Using this information, you can identify whether the network has scale-free / small-world (long-tail power law distribution) or random (normal, bell-shaped distribution) network properties.

This may be important for understanding the level of resilience and the dynamics of propagation in this network. E.g. scale-free networks with long degree tails are more resilient against random attacks and will propagate information across the whole structure better.

This may be important for understanding the level of resilience and the dynamics of propagation in this network. E.g. scale-free networks with long degree tails are more resilient against random attacks and will propagate information across the whole structure better.

If a power-law is identified, the nodes have preferential attachment (e.g. 20% of nodes tend to get 80% of connections), and the network may be scale-free, which may indicate that it's more resilient and adaptive. Absence of power law may indicate a more equalized distribution of influence.

Kolmogorov-Smirnov test compares the distribution above to the "ideal" power-law ones (^1, ^1.5, ^2) and looks for the best fit. If the value d is below the critical value cr it is a sign that the both distributions are similar.

Kolmogorov-Smirnov test compares the distribution above to the "ideal" power-law ones (^1, ^1.5, ^2) and looks for the best fit. If the value d is below the critical value cr it is a sign that the both distributions are similar.

Please, enter a search query to visualize the difference between what people search for (related queries) and what they actually find (search results):

Please, enter a search query to discover what else people are searching for (from Google search or AdWords suggestions):

Find a market niche for a certain product, category, idea or service: what people are looking for but cannot yet find*

Please, enter your query to visualize Google search results as a graph, so you can learn more about this topic:

Enter a search query to analyze the Twitter discourse around this topic (last 7 days):