#machine_learning #supervised #unsupervised
16010218210000000
#machine_learning #supervised #unsupervised
16010218210000000
#features are used in #machine_learning for #differentiation
16010218950000000
#features are used for #training in #machine_learning
16010219230000000
the #relationship is called a #model in #machine_learning
16010219690000000
#supervised #machine_learning finds #patterns between #data and #labels
16010220170000000
#patterns are used to make #predictions
16010220240000000
#supervised works with #labeled #data
16010220620000000
the goal of #unsupervised is to identify meaningful #patterns in #data
16010221780000000
sometimes a #model can find #patterns that represent #stereotypes of #bias
16010222420000000
#clustering is a type of #unsupervised #learning
16010222680000000
with #reinforcement_learning (RL) you set up a #model (called an #agent in RL) where it receives a #reward each time it performs well (#reward_function)
16010223910000000
a #shaped #reward increases in the states closer to the #goal #state
16010228960000000
a #sparse #reward is in the #goal #state only
16010229280000000
#positive_reinforcement is an important element of #reinforcement_learning
16010233020000000
if the #reward is providing the #features to the #model that could improve the performance
16010233700000000
#machine_learning #problems: #classification, #regression, #clustering, #association learning #structured_output #ranking
16010234640000000
#clustering is an #unsupervised #learning problem
16010235060000000
#regression requires labeled data — #supervised learning problem
16010235420000000
#classification requires a set of #labels - so it is #supervised
16010235640000000
a #neural_network works through #representations
16010236980000000
#machine_learning process as an #experiment where we run #test after test after test to converge on a workable #model
16010242270000000
A well-defined #problem has both #inputs and #outputs. #inputs are the #features. #outputs are the #labels to predict.
16011504310000000
#training means creating or #learning the #model
16011570040000000
#inference means applying the trained #model to #unlabeled #examples
16011570150000000
A #regression #model #predicts continuous #values.
16011570330000000
A #classification #model #predicts #discrete #values.
16011570470000000
#machine_learning needs to provide #decisions rather than just #predictions
16011573670000000
#labels are the #variables or #values for #predictions
16012012510000000
#features are #inputs #variables describing the #data
16012012690000000
#model_training is done by the #data which has #features and #labels so that it knows what #correlations to extract
16012013200000000
#model maps #examples to predicted #labels
16012013740000000
#loss_function is showing us the degree of #deviation of the #model #prediction from the real #values
16012017050000000
#loss_function can be a square #difference between the #prediction and the #labels
16012017430000000
#loss_function = #observation - #prediction
16012017860000000
#mean_square_error estimates the #deviation of the #loss_function - #prediction from #labels for each element on average
16012020980000000
#model_training is usually based on reducing the #loss in the #loss_function via #mean_square_error but not only
16012022160000000
the #gradient_descent approach is used to minimize the #loss_function
16012045160000000
the #learning_rate determines the size of the #gradient_descent
16012047110000000
#epoch is the number of #batches used in #machine_learning
16012051550000000
if the #training #loss is decreased we say that it's #converged
16012058970000000
#goldilocks #learning rate for a curve, where a #gradient descent reaches the minimum point in the fewest number of steps?
16025817060000000
In #supervised #learning, a machine #learning algorithm builds a model by examining many examples and attempting to find a model that minimizes #loss; this process is called empirical #risk_minimization.
16025818180000000
#loss is the penalty for a bad #prediction. That is, loss is a number indicating how bad the model's #prediction was on a single example.
16025818380000000
Mean square error (#mse) is the average squared #loss per example over the whole #dataset.
16025818610000000
#iterative #learning is used to decrease #loss over time by making small steps and receiving #feedback on the #outputs
16025819620000000
Usually, you #iterate until overall #loss stops changing or at least changes extremely slowly. When that happens, we say that the #model has #converged.
16025822530000000
#epoch represents a full training pass over the entire #dataset such that each example has been seen once. Thus, an #epoch represents N / #batch size training iterations, where N is the total number of examples.
16025843760000000
#learning_rate is a scalar used to #train a model via #gradient_descent. During each iteration, the #gradient_descent algorithm multiplies the #learning_rate by the #gradient. The resulting product is called the #gradient step. #learning_rate is a key #hyperparameter.
16025844630000000
#batch is a number of #examples used in an #iteration
16025845230000000
#anomalies in #features may indicate a potential problem in a #dataset - one should be more careful using that sort of data.
16025888810000000
#learning_rate specifies the size of the step, #batch specifies how many elements we take into the learning process, #epoch specify how many iterations we're going to have
16025892340000000
a #synthetic_feature is made out of several #features and may help #prediction
16025896150000000
a #correlation_matrix shows if there are any #correlations between the #features
16025898840000000
we take a #training_set from our #data and a #test_set then we train the model on the #training_set to see how well it #prediction on the #test_set
16025905230000000
An #overfit model gets a low #loss during training but does a poor job #predicting new data.
16026035820000000
#machine_learning 's goal is to predict well on new #data drawn from a (hidden) true #probability #distribution.
16026036480000000
The less complex a #machine_learning model, the more likely that a good #empirical result is not just due to the #peculiarities of the #sample.
16026036890000000
Partitioning a #data set into a #training_set and #test_set lets you judge whether a given #model will generalize well to new #data.
16026042880000000
#partitioning a #data set into a #training_set and a #test_set
16026044740000000
a #training_set can be split into a smaller #training_set, #test_set and a #validation_set so that the model can be trained better and no #overfitting occurs
16026049020000000
#features are very important for #machine_learning models
16026135580000000
#features engineering should remove #outliers (which might lead to the problem that a #model cannot #predicting outstanding events)
16026138050000000
#data_visualization is important for knowing #data and improving #machine_learning models
16026138800000000
#one_hot_encoding allows us to incorporate categorical #data into our #model
16026141480000000
for very large #values #sparse #representation is used
16026141940000000
#binning_values allows us to to simplify the #data and bring it to the #feature_vector using #one_hot_encoding
16026145190000000
#binning_values by #quantile ensures the number of #examples in each bucket is different
16026145840000000
#feature_crossing can be a powerful way to improve #prediction by combining the #data #features of the #dataset
16026156950000000
#feature_crossing is often done for #one_hot_encoding where multiple #features are crossed to produce interesting #feature_vector
16026185150000000
#feature_crossing is one learning strategy #neural_network is another
16026188030000000
minimize #loss_complexity, which is called #structural_risk_minimization allows to avoid #overfitting the #model
16026258790000000
the #loss_term, measures how well the #model fits the #data, and the #regularization term, which measures model #complexity.
16026275040000000
model #complexity as a function of the #weights of all the #features in the #model.
16026276470000000
model #complexity as a function of the total number of #features with nonzero #weights. (A later module covers this approach.)
16026276470000000
#model developers tune the overall impact of the #regularization term by multiplying its value #scalar known as #lambda (also called the #regularization rate).
16026692460000000
#regularization term, which measures model #complexity.
16026692710000000
#regularization is a technique used in an attempt to solve the #overfitting problem in statistical models.
16026697380000000
#regularization penalizes the #loss_function in that it pushes the model to give lower #weights to each parameter in the model
16026698200000000
a model is #learning the #weights for each of the #features as it is #training itself to minimize #loss and #complexity
16026713960000000
#regularization penalizes the #loss_function for too much #complexity (a high number of #features with nonzero #weights )
16026716020000000
#sigmoid_function maps the the output of the #linear_layer of a model trained with #logistic_regression between zero and one
16026720630000000
#logistic_regression returns a #probability for a #classification
16026750260000000
In order to map a #logistic_regression value to a binary category, you must define a #classification_threshold (also called the #decision_threshold) #threshold
16026750440000000
A #true_positive is when a model correctly #predicts the #positive_class (it made it a prediction and it was true). A #true_negative is when a model correctly #predicts a #negative_class
16026753330000000
A #false_positive is when a model incorrectly #predicts the #positive_class. A #false_negative is when a model incorrectly #predicts the #negative_class
16026753790000000
#accuracy of a model #prediction is the ratio of the #correct_predictions to the total number of #predictions
16026754770000000
#prediction #predicts
16026754870000000
#predicting the #prediction
16026755020000000
#accuracy then is a ratio of #true_positive plus #true_negative to the sum of all other predictions (#true_negative + #true_positive + #false_negative + #false_positive)
16026760070000000
#accuracy alone doesn't tell the full story when you're working with a #class_imbalanced_data_set, like this one, where there is a significant #disparity between the number of #positive and #negative #labels.
16026761660000000
#precision is a ratio of #true_positive to #total_positives claimed (#true_positive + #false_positive)
16026763640000000
#recall is the ratio of #true_positive to the #total_positives that really happend (#true_positive + #false_negative)
16026764590000000
#classification_threshold should strike a balance between #precision and #recall so that both values are at their maximum
16026766240000000
#precision is based on a #claim and #recall is based on #reality
16026767150000000
The #f1_score is the harmonic #mean of the #precision and #recall
16026767960000000
#roc_curve (receiver operating characteristic curve) is a graph showing the performance of a #classification model at all #classification thresholds: the #true_positive rate vs the #false_positive rate
16026890920000000
#auc provides an aggregate measure of performance across all possible #classification #threshold
16026891300000000
#true_positive rate is basically the #recall because it's the relation of #true_positive to the sum of #true_positive and #false_negative
16026894150000000
#false_positive is the reverse of that: what is the ratio of the #claim of the positive that is not true to #reality
16026894840000000
#logistic_regression #predictions should be #unbiased. That is: "average of #predictions" should ≈ "average of observations"
16026907140000000
#prediction_bias is a quantity that measures how far apart the #predictions are from the #observations
16026907520000000
A #z_score is the number of #standard_deviations from the #mean for a particular raw value
16027931560000000
#sparse_vector often contain many #dimensions. Creating a #feature_cross results in even more #dimensions. Which may lead to a higher use of #resources and #memory
16028434020000000
#sparse_vector often contain many #dimensions. Creating a #feature_cross results in even more #dimensions. Which may lead to a higher use of #resources and #memory
16028434020000000
in a high-dimensional #sparse_vector it is good to encourage as many #weights as possible to be zero, so that we reduce the #complexity of the #model and the toll on #resources
16028435460000000
#l2_regularization encourages #weights to be small, but doesn't force them to exactly zero
16028436030000000
#l2_regularization is a sum of #weights and it encourages them to be smaller to reduce the #complexity of the model
16028437430000000
#lambda is used together with #l2_regularization to reduce complexity without pushing the #weights too high
16028437990000000
#regularization and #l2_regularization
16028438170000000
#l2_regularization penalizes #weights square, while #l1_regularization penalizes the #weights
16028439830000000
if we then take a #derivative of #l1_regularization we will remove all the zero #values
16028440060000000
#neural_networks are a more sophisticated version of #feature_cross. In essence, neural networks do the appropriate #feature_crossing for you.
16028443800000000
"#nonlinear" means that you can't accurately predict a #label with a model of the form In other words, the "#decision_surface" is not a #line
16028447790000000
#feature_cross is one possible approach to modeling #nonlinear problems.
16028448130000000
a #linear model (#linear_layer) can be represented as a #graph: the #inputs are the #features and the #output is the weighed sum of the #inputs (sum of #weights)
16028449160000000
where a #linear model doesn't work we can use the #nonlinear
16028449330000000
a #hidden_layer is a weighed sum of the #Input #values
16028450810000000
a #hidden_layer is a combination of #inputs
16028451020000000
a #hidden_layer is still part of a #linear model
16028451240000000
a #linear model cannot serve #nonlinear problems (e.g. it cannot fit the #predictions to a curve or identify certain areas or complex #patterns
16028452470000000
that's why we want to introduce #nonlinear model - we do that by piping each #hidden_layer node through a #nonlinear function
16028452920000000
the #nonlinear function is called the #activation_function - this lets us model very complicated #relations between the #inputs and #outputs
16028453710000000
#sigmoid #nonlinear #activation_function converts the #weights sum to a value between 0 and 1
16028454550000000
#rectified #linear unit #activation_function (or #relu, for short) often works a little better than a smooth function like the #sigmoid as #relu helps add #nonlinear dynamics into the layers of the #inputs
16028455550000000
#redundancy can be important for a #neural_network as it increases the possibility of #feature_cross that is useful
16028461870000000
an extreme case of #overfitting is #memorizing in which case rather than learning the general #ground_truth the model starts to adapt to the peculiarities and specificities in the #training_set so it becomes less fit to detect the new #patterns in a new set of #data
16028462920000000
nother form of #regularization, called #dropout, is useful for neural networks. It works by randomly "dropping out" unit #activations in a network for a single gradient step.
16028687260000000
#multi_class #neural_networks help identify multiple #labels
16028691760000000
Given a #classification problem with N possible solutions, a #one_vs_all solution consists of N separate #binary #classifiers—one #binary classifier for each possible #outcome.
16028693940000000
#softmax extends the idea of #logistic_regression into a #multi_class world. That is, #softmax assigns decimal #probabilities to each class in a #multi_class problem.
16028695230000000
instead of the #binary answer in #one_vs_all layer, #softmax gives a #probability for each #outcome
16028696060000000
for example, a #number #classification problem is a #multi_class #classification problem with 10 output classes, one for each digit.
16028708920000000
#collaborative_filtering is the task of making #predictions about the #interests of a user based on #interests of many other #users.
16028734010000000
an #embedding_space maps data by its #features so that the items that are more #similar (or are more likely to be used together) are closer to each other in the #space.
16028752990000000
the #embedding_space can consist of many #dimensions and some of them might not have the exact semantic #meanings in which case they are called #latent_dimension representing a #feature that is not explicit in the #data but is rather inferred from it.
16028754420000000
ultimately it's the distances between the data elements that are important in #embedding_space not the actual #values.
16028754880000000
#categorical_data refers to input #features that represent one or more discrete items from a finite set of choices. For example, it can be the set of movies a user has watched, the set of words in a document, or the occupation of a person. #categorical_data is most efficiently represented via #sparse_tensors which are tensors with very few non-zero elements (also see #sparse_vector)
16028756220000000
in order to use #sparse_vector #representations within a machine learning system, we need a way to represent each #sparse_vector as a vector of #numbers so that semantically #similar items (movies or words) have #similar distances in the vector space. But how do you represent a word as a vector of #numbers?
16028757450000000
for example, in #one_hot_encoding you would map one of the 500 000 words in a vocabulary as a #sparse_vector where item 2019 would be 1 and the rest are zero (the vector represents the word)
16028758160000000
a "#bag_of_words" #representation contains chunks of words in a #sparse_vector so several values are 1 and most are zero
16028758770000000
#embeddings translate large #sparse_vector into a lower-dimensional #space that preserves #semantic relationships.
16028760620000000
An #embeddings is a #matrix in which each column is the #vector that corresponds to an item in your #vocabulary. To get the dense #vector for a single #vocabulary item, you retrieve the column corresponding to that item.
16028762170000000
#principal_component_analysis (PCA) has been used to create #word #embeddings. Given a set of instances like #bag_of_word #vectors, PCA tries to find highly correlated #dimensions that can be collapsed into a single #dimension.
16028763630000000
#word2vec is an example of representing the language as a #sparse_vector — mapping semantically #similar words to geometrically close #embedding #vectors
16028765650000000
#word2_vec exploits contextual information like this by training a neural net to distinguish actually co-occurring groups of #words from randomly grouped #words. The #input layer takes a #sparse_vector representation of a target word together with one or more context words.
16028766440000000
#static_model is trained #offline a #dynamic_model is trained #online
16028768920000000
#bias arises when we don't include what we consider to be #typical in a set of #features
16028777890000000
#reporting_bias occurs when the #frequency of events, properties, and/or outcomes captured in a #data set does not accurately reflect their real-world #frequency.
16028782390000000
#automation_bias is a tendency to favor results generated by #automated systems over those generated by #humans
16028782650000000
#selection_bias occurs if a #data set's examples are chosen in a way that is not reflective of their real-world #distribution
16028782910000000
#confirmation_bias, where model builders unconsciously process data in ways that affirm preexisting #beliefs and #hypotheses. In some cases, a model builder may actually keep training a model until it produces a result that aligns with their original hypothesis; this is called #experimenters_bias
16028784240000000
[bias]
#confirmation_bias #experimenters_bias #selection_bias #automation_bias #reporting_bias are all times of @bias
16028784540000000
#confusion_matrix summarizes how successful #predictions are (it has #precision / #recall #matrix)
16028786790000000
in order to avoid #bias it is important to also test it across the #categorical_data (eg. only for men only for women) with #recall #precision or #accuracy in order to see if it's biased towards a certain category
16028787320000000
#detrended_fluctuation_analysis or #dfa is a method for determining the statistical #self_affinity of a #signal. It is useful for analysing #time_series that appear to be long-memory processes (diverging correlation time, e.g. #power_law decaying autocorrelation function) or #1f_noise.
16029688790000000
The obtained #exponent is similar to the #hurst_exponent, except that #dfa may also be applied to signals whose underlying statistics (such as #mean and #variance) or dynamics are #non_stationary (changing with time)
16029689080000000
In #dfa the scaling exponent #alpha is calculated as the #slope of a straight line fit to the log #log graph of F(n)}F(n) using leas #squares. an exponent of 0.5 would correspond to #uncorrelated #white_noise, an exponent of 1 is #pink_noise
16029690580000000
Another way to detect #pink_noise is to build a graph where the x axis are the #events while the y axis records a #time_series estimation relative to the #standard_deviation from the #average (#mean) time interval.
16029693990000000
At its essence #pink_noise is based on #self_affinity and #self_similarity, so that no matter what scale you look at, the pattern is #similar (#scale_free)
16029694710000000
#power_spectral_analysis describes distribution of #power across #frequency components composing the #signal - for #pink_noise we have a 1/f relationship — few powerful signals with low frequency, a long tail of less powerful ones (of which there are many) (hence #1f_noise)
16029700860000000
#envelope is a smooth #curve outlining the extremes of a #signal and it is also calculated in #hilbert_transform, which, in turn is used in calculating #dfa or #detrended_fluctuation_analysis
16048776010000000
#detrended_fluctuation_analysis (#dfa) has proven particularly useful, revealing that genetic #variation, normal development, or #disease can lead to differences in the #scale_free #amplitude #modulation of oscillations https://www.frontiersin.org/articles/10.3389/fphys.2012.00450/full
16049192880000000
The reason why #chaotic #variation (#pink_noise) is indicative of a #healthy state is because it reflects #winnerless_competition behind the process. If there's a deviation in this dynamics (eg some #patterns), it could mean that one process is #dominating the rest.
16049193600000000
#self_affinity is a property of #fractal #time_series where the small parts of the whole are #similar to the whole
16049202650000000
#self_affinity processes and #self_similar structures have in common that the statistical #distribution of the measured quantity follows a #power_law function, which is the only mathematical function without a characteristic scale. Self-affine and #self_similar phenomena are therefore called "#scale_free.”
16049204010000000
In #power_law #distribution the #mean would not necessarily be the same as the #median (which is are closer to each other in #normal #distribution)
16049204950000000
A #power_law #distribution means that there is big number of #small #variation and a small number of #big #variation (hence the line with a negative #slope when expressed as a #log)
16049207830000000
In a #1f #signal the lower #frequency objects have larger #amplitude than the higher #frequency objects (#1f_noise) https://www.frontiersin.org/files/Articles/23105/fphys-03-00450-HTML/image_m/fphys-03-00450-g001.jpg
16049214350000000
the #frequency of a certain #size of flower being inversely #proportional to its #size.
16049215210000000
#time_series in which all #frequency are represented with the same #amplitude will lack the rich variability of the #scale_free #time_series and is referred to as "#white_noise”
16049216050000000
To estimate the #scale_free property we calculate the #standard_deviation (#signal in relation to #mean) over the differently sized #time_windows. If as the #time_windows size increases the #standard_deviation also increases, we're dealing with a #scale_free process. If the #scaling_effect is not there, then it's not a scale free process.
16049220900000000
a stationary #random #fluctuating process has a #signal profile, which is #self_affine with a #scaling_exponent α = 0.5
16049249650000000
when we add #memory in the sense that the #probability of an action depends on the previous actions that the walker has made — we will get a process that will exhibit #self_affinity across scales (#scale_free)
16049250440000000
Different classes of processes with #memory exist: #positive_correlation and those with #anti_correlation - anti-correlations can be seen as a #stabilizing mechanism - a future action is more likely to be opposite than the ones made before. In this case on longer windows (time scales) we will have lower #fluctuating so the coefficient will be lower (α 0 to 0.5) - has #memory, #anti_correlation. 0.5 - #random, 0.5 to 1 - has #memory and #positive_correlation (previous actions increase the likelyhood of that action taken again) https://www.frontiersin.org/files/Articles/23105/fphys-03-00450-HTML/image_m/fphys-03-00450-g003.jpg
16049257140000000
for #dfa the signal is transformed into the #cumulative_signal, then it is split into several #windows equal in size on the #log scale. then for each the data is #detrended and #standard_deviation is calculated for each #window. then #fluctuating function is calculated as the mean #standard_deviation for all the #windows. Then we plot that as a graph on #log scales. The #dfa exponent α is the #slope of the trend. If it follows a straight line 45° then it means that with every #window increase we do not have a #proportional increase in the mean of fluctuation (so it is #linear). If it is more, then it is #non_linear and shows that it is in fact #scale_free
16049268440000000
The lower end of the fitting range is at least four samples, because #linear #detrending will perform poorly with less points (Peng et al., 1994). For the high end of the fitting range, #dfa estimates for window sizes >10% of the #signal length are more noisy due to a low number of windows available for averaging (i.e., less than 10 windows). Finally, the 50% overlap between windows is commonly used to increase the number of windows, which can provide a more accurate estimate of the fluctuation function especially for the long-time-scale windows.
16049270000000000
A #brown_noise process can be obtained by successively summing data points in the #white_noise process. https://www.researchgate.net/publication/232236967_A_tutorial_introduction_to_adaptive_fractal_analysis/figures?lo=1
16190243450000000
Using the classical #dfa method, the #cumulative_sum of data are divided into segments, and the #variance of these sums is studied as a function of segment length after linearly detrending them in each segment. https://www.nature.com/articles/s41598-019-42732-7
16190250690000000
In #dfa, data are divided into segments of length L and are #linearly detrended. The #square_root of the #variance (called #fluctuation) of the detrended data is studied as a function of L. It can be shown that a #linear relationship between the #logarithm of the #fluctuation and the #logarithm of L is indicative of a #power_law behavior of the spectrum. https://www.nature.com/articles/s41598-019-42732-7
16190255840000000
If a #linear relationship between the length of a #segment or #time_windows and the strength of the #fluctuation (or the #square_root of the #variance of the #cumulative_signal) exists, the slope of the corresponding line is also referred to as #hurst_exponent.
16190283930000000
For #white_noise the #hurst_exponent or the relation between the #time_windows and the #fluctuation (square root of #variance) will be #linear: when we double the #time_windows the #fluctuation (or #variance of the #cumulative_sum) will also double.
16190320820000000
For #pink_noise #1f_noise the #hurst_exponent will equal #1 and will mean that for #time_windows twice longer the #fluctuation will increase about 4 times. In other words, the the longer is the #time_windows the more #fluctuation occurs (#positive_correlation).
16190325510000000
#hurst_exponent in this context is #alpha_exponent, because we use #alpha_exponent for #non_stationary processes
16190326350000000
if #alpha_exponent is more than 1, it means that for every increase of scale (#time_windows) the cumulative_sum of #fluctuation increases a lot. That means, the longer we look at the process, the more likely it is to have big #fluctuation — there is a tendency in the #short_term to be #small and in the #long_term there's a tendency to be #big.
16190328080000000
the #cumulative_sum of the difference from the #average of a #time_series will be #brown_noise (#random_walk) for the #white_noise
16190339070000000
[accessed_apr_21_2021]
In contrast, #0.5 < #hurst_exponent < #1 indicates a #correlated process for #f_gn or what is termed a #persistent process for #f_bm. In this case, #increases in the signal (for #f_gn) or increments of the signal (for #f_bm) are likely to followed by further #increase, and #decrease are likely to be followed by #decreases (i.e., a #positive #long_term #correlation). Anti-#persistent and #persistent processes contain #structure that distinguishes them from truly #random sequences of data. (2) (PDF) A tutorial introduction to adaptive fractal analysis. Available from: https://www.researchgate.net/publication/232236967_A_tutorial_introduction_to_adaptive_fractal_analysis [accessed Apr 21 2021].
16190379890000000
The difference between the #exponent or #exponential_decay and the #power_law #decay is that #power_law #decay is slower: there are more values with a low #amplitude in the case of the #power_law https://math.stackexchange.com/questions/164436/difference-between-power-law-distribution-and-exponential-decay
16195411410000000
#downsampling (in this context) means #training on a disproportiona#tely_low_subset_of_the_#majority_class examples.
16209001190000000
#up#weighting means adding an example #weight to the downsampled class equal to the factor by which you performed #downsampling.
16209001560000000
#normalizing - transforming #numeric data to the same #scale as other #numeric data.
16209003410000000
#bucketing - transforming #numeric (usually #continuous) #data to #categorical_data.
16209003590000000
#scaling means converting #floating_point #feature #values from their #natural #range (for example, 100 to 900) into a #standard #range—usually 0 and 1
16209005210000000
If your data set contains extreme #outliers, you might try #feature_clipping, which caps all feature #values above (or below) a certain value to fixed value. https://developers.google.com/machine-learning/data-prep/transform/normalization
16209005990000000
#log #scaling computes the #log of your values to compress a wide #range to a narrow #range. #log_scaling is helpful when a handful of your values have many points, while most other values have few points. This data #distribution is known as the #power_law #distribution. Movie ratings are a good example. In the chart below, most movies have very few ratings (the data in the tail), while a few have lots of ratings (the data in the head). #log_scaling changes the #distribution, helping to improve linear model performance.
16209006980000000
#z_score is a #variation of #scaling that represents the number of #standard_deviations away from the #mean. You would use z-score to ensure your #feature distributions have #mean = 0 and std = 1. It’s useful when there are a few #outliers, but not so extreme that you need #clipping.
16209007620000000
#transformation of #numeric #features into #categorical #features, using a set of #thresholds, is called #bucketing (or #binning) - creating #buckets
16209010880000000
creating #buckets that each have the same number of points. This technique is called #quantile_bucketing.
16209023430000000
when we represent a #categorical #value with a #number it's called a #vocabulary
16209035000000000
#one_hot_encoding represents #numeric #values as #vectors - which can then be further compressed with #sparse_vector
16209035990000000
#grouping #un#labeled #examples is called #clustering. As the examples are un#labeled, #clustering relies on #unsupervised #machine_learning. If the examples are #labeled, then #clustering becomes #classification.
16209055020000000
#hierarchical_clustering creates a tree of #clusters. #hierarchical #clustering, not surprisingly, is well suited to #hierarchical #data, such as #taxonomies.
16209067630000000
#distribution_based_#clustering This #clustering approach assumes #data is composed of #distributions, such as#gaussian_#distributions. Then it #clusters them accordingly.
16209068320000000
#density_based_clustering connects areas of high example #density into #clusters. This #clustering allows for arbitrary-shaped #distributions as long as dense areas can be connected. These algorithms have difficulty with data of varying #densities and high #dimensions and also with #outliers.
16209068870000000
#centroid_based_#clustering organizes the data into #non_#hierarchical_clusters, in contrast to #hierarchical #clustering defined below. #k_means is the most widely-used centroid-based #clustering #algorithm.
16209069360000000
In order to perform #clustering we need to quantify the #similarity between examples by creating the #similarity_metrics for our #dataset
16209070400000000
for #data #processing we need to create #quantile or use #quantile_bucketing when the #distribution is #poisson — neither #gaussian nor #power_law.
16209071750000000
when the #distribution is #gaussian we can #normalizing our #data
16209072190000000
when the #distribution is #power_law we might want to use #log_scaling #normalizing for our data
16209072510000000
we can do either #manual #similarity or #supervised #similarity. you switch to a #supervised_similarity_measure when you have trouble creating a #manual_similarity_measure.
16209074210000000
#mean_square_error shows the #average squared #loss for an #example
16209075920000000
we can calculate #similarity by calculating a root #mean_square_error or the #sums of the #features (e.g. size and price). the lower the value, the higher is the similarity.
16209117140000000
For #categorical_data we can calculate #similarity using #jaccard_similarity which shows the proportion of intersection between the #sets
16209118550000000
#k_means groups points into #clusters by minimizing the #distances between points and their #cluster’s #centroid (as seen in Figure 1 below). The #centroid of a #cluster is the #mean of all the points in the #cluster.
16209172940000000
Instead of comparing manually-combined #feature #data, you can reduce the #feature #data to #representations called #embeddings, and then compare the #embeddings
16209181460000000
#embeddings are generated by training a #supervised deep neural network (#dnn) on the #feature data itself. The #embeddings_map the #feature data to a #vector in an #embedding_space. Typically, the #embedding_space has fewer dimensions than the #feature data in a way that captures some #latent #structure of the #feature data set.
16209184210000000
A #dnn that learns #embeddings of #input data by predicting the #input data itself is called an #autoencoder. An #autoencoder is the simplest choice to generate #embeddings. However, an #autoencoder isn't the optimal choice when certain features could be more important than others in determining #similarity.
16209185270000000
Since this #dnn predicts a specific input #feature instead of predicting all input #features, it is called a predictor #dnn
16209185550000000
To train the #dnn, you need to create a #loss_functionby following these steps: 1) calculate the #loss for every #output of #dnn. For #numeric outputs use #mean_square_error, for #categorical use #log_loss, for #multivalent #categorical use #softmax_cross_entropy (#entropy) loss.
16209187000000000
in #poisson distribution the #decay happens much faster than in the #power_law #distribution — if in #power_law you have a significant number of nodes the #tail, then in #poisson you only have a few.
16217118970000000
A #similarity measure takes the
16219366980000000
#embeddings generated by our neural network (#dense_features) and returns a number measuring their #similarity.
16219366980000000
To calculate #similarity we have 3 measures to choose from: #euclidian_distance (substraction of vectors), #cosine_distance (cosine of the angle between the vectors) and the #dot_product (cosine multiplied by the lengths of both vectors)
16219369490000000
In contrast to the #cosine_distance, the #dot_product is proportional to the #vector #length. This is important because examples that appear very frequently in the training set (for example, popular YouTube videos) tend to have embedding #vectors with large #lengths. If you want to capture #popularity, then choose #dot_product.
16219370590000000
#cluster #cardinality is the number of examples per #cluster. We are looking for #outliers and if we do find them, this may indicate some interesting #patterns
16219382860000000
Cluster #magnitude is the sum of #distances from all examples to the #centroid of the #cluster. Similar to #cardinality, check how the #magnitude varies across the #clusters, and investigate #anomalies. and #outliers
16219383310000000
Notice that a higher #cluster #cardinality tends to result in a higher #cluster #magnitude, which intuitively makes sense. Clusters are #anomalous when #cardinality doesn't correlate with #magnitude relative to the other #clusters.
16219388730000000
#content_based_filtering Uses #similarity between items to #recommend items similar to what the user likes.
16219394200000000
#collaborative_filtering uses Uses #similarity between #queries and #items simultaneously to provide #recommendations.
16219394430000000
Both #content_based_filtering and #collaborative_filtering map each #item and each #query (or #context) to an #embedding_vector
16219395220000000
#recommendations - We again place our #users in the same #embedding_space to best explain the #feedback_matrix: for each (#user, #item) pair, we would like the #dot_product of the #user #embedding and the #item #embedding to be close to 1 when the #user watched the movie, and to 0 otherwise.
16219406780000000
The #dot_product of the #user_#matrix and #item_#matrix yields a #recommendation #matrix that contains not only the original user ratings but also #predictions for the movies that each user hasn't seen
16219414650000000
#matrix_factorization In math, a mechanism for finding the matrices whose #dot_product approximates a #target_matrix.
16219414960000000
#generative_adversarial_networks (#gans) are an exciting recent innovation in #machine_learning. #gans are #generative models: they create new #data_instances that resemble your #training_data. For example, #gans can create images that look like photographs of human faces, even though the faces don't belong to any real person.
16219422540000000
#gans achieve this level of realism by pairing a #generator, which learns to produce the target output, with a #discriminator, which learns to distinguish true data from the output of the #generator. The #generator tries to fool the #discriminator, and the #discriminator tries to keep from being fooled.
16219425150000000
#generative" describes a class of statistical models that contrasts with #discriminative models. #generative models can generate new #data #instances. #discriminative models #discriminate between different kinds of #data #instances.
16219446830000000
More formally, given a set of #data #instances X and a set of #labels Y: #generative models capture the joint #probability p(X, Y), or just p(X) if there are no #labels. #discriminative models capture the #conditional #probability p(Y | X).
16219447850000000
The #generator learns to generate #plausible #data. The generated instances become negative #training #examples for the #discriminator.
16219452940000000
The #discriminator learns to distinguish the #generator's #fake #data from #real #data. The #discriminator #penalizes the #generator for producing implausible results.
16219453280000000
Through #backpropagation, the #discriminator s #classification provides a signal that the #generator uses to update its #weights.
16219453920000000
The #discriminator in a #gan is simply a #classifier. It tries to distinguish real #data from the #fake #data created by the #generator.
16220231030000000
The #discriminator connects to two #loss functions. During #discriminator training, the #discriminator ignores the #generator #loss and just uses the #discriminator #loss.
16220237690000000
The #generator part of a #gan learns to create #fake data by incorporating #feedback from the #discriminator. It learns to make the #discriminator #classify its output as real.
16220238440000000
The #generator feeds into the #discriminator net, and the #discriminator produces the output we're trying to affect. The #generator #loss penalizes the #generator for producing a sample that the #discriminator network classifies as #fake.
16220239740000000
#research has suggested that if your #discriminator is too good, then #generator training can fail due to #vanishing_gradients. In effect, an optimal #discriminator doesn't provide enough #information for the #generator to make #progress.
16220244120000000
#wasserstein_loss: The #wasserstein_loss is designed to prevent #vanishing_gradients even when you train the #discriminator to #optimality.
16220244420000000
#convolutional_neural_network (#cnn) could be used to progressively extract higher- and higher-level #representations of the image #content.
16220254470000000
#backpropagation is a process of calculating the #gradient for the #neural_network. it is used to see how to bring the #loss_function to the minimum, in which direction the learning should take place.
16223812730000000
in the process of #learning each #layer of a #neural_network will have #neurons with #weights ascribed to them, which enable #differentiation of various #features
16223813890000000
The #weights of the neurons combine through #layer using the #activation_function (e.g. #sigmoid or #relu), which then leads to only a certain neuron at the last layer to get activated. We can say that first #layer detects some general #features, the next one — more specific ones, and so on. but this is not the case.
16223816610000000
Word Count | Unique Lemmas | Characters | Lemmas Density |
0 |
0 |
0 |
0 |
Modularity | 0 |
Influence Distribution | 0 % |
---|
Topics | Nodes in Top Topic | Components | Nodes in Top Comp |
0 |
0 % |
0 |
0 % |
Nodes | Av Degree | Density | Weighed Betweenness |
0 |
0 |
0 |
0 |