Method where you train the program by feeding the learning algorithm with a mapping of inputs to correct outputs.
Regression is curve fitting: learn a continuous input
Outputs are discrete variables (category labels). Learn a decision boundary that separates one class from the other. Generally, a confidence is also desired, i.e., how sure are we that the input belongs to the chosen category.
The training set is a set of
The loss function for a model
Perceptron is the trivial neural network. The model for a parameter
Where
If
The learning algorithm of the Perceptron is quite simple. The learning rate
- Starting with random weights, then show each sample in sequence repetitively.
- If the output is correct, do nothing.
- If the produced output is negative, and the correct output is positive, increase the weights.
- If the produced output is positive, and the correct output is negative, decrease the weights.
- The amount to increase/decrease is given by the current sample scaled by the learning rate.
The error function for a model
This function is known and calculable.
The error function for a model
This function is not known, and only approachable.
A good approximation of $E\text{out}$(f) is the error in a test (or validation) sample
Given a model
Notably, $E\text{in}(f)$ and $E\text{out}(f)$ deviates as
During the learning algorithm, always conserve the weights that produce the lower error.
This has a disadvantage: It memorizes the training set.
Each layer in the tree consists of an attribute that splits the data into subsets that
are ideally disjoint.
The entropy of the subsets produced is a measure of how disjoint they are. \
For a set containing
A given attribute
The expected entropy remaining after applying
The information gain, i.e. the reduction in entropy for
The capacity is a measure of when the training error is a good approximation for the test error.
Bias is the error due to the fact that the set of functions does not contain the
target function.
Variance is the error due to the fact that if we had been using another training set
drawn from the same distribution, we would have obtained another function.
Regularization is a method for minimizing the training error, as long as it is still a good approximation for the test error, trading-off accuracy for simplicity.
Using the sigmoid as the activation function, and the squared-error loss function:
To find in which direction the weights minimizes
Where the delta rule is
And the slope of ligistic is
@@latex:\newpage@@
The learning rate
- Starting with random weights.
- Compute
$∇ L(\vec{W})$ . $\vec{W} ← \vec{W} - r ⋅ ∇ L(\vec{W}) = \vec{W} - r ⋅ ∑\limits_i^m Δ Ψ$ - Repeat steps 2 and 3 until
$\vec{W}$ doesn’t change anymore $(10-5)$.
After each iteration,
- If
$L(\vec{W})$ is converging, the learning rate is correct. - If
$L(\vec{W})$ is diverging, the learning rate is too large. - If
$L(\vec{W})$ is converging slowly, the learning rate too small.
Also, the algorithm needs feature scaling
Instead of inspecting the whole dataset to detect the direction which minimize
- Randomly shuffle the training set.
- Starting with random weights.
- For each sample
$(\vec{X_i}, y_i)$ :$\>\vec{W} ← \vec{W} - r ⋅ Δ Ψ$ - Repeat step 3 until
$\vec{W}$ doesn’t change anymore $(10-5)$.
Convergence is not so obvious. After each bulk of iterations, e.g. 1000, check
- If
$L(\vec{W})$ is converging, the learning rate is correct. - If
$L(\vec{W})$ is diverging, the learning rate is too large. - If
$L(\vec{W})$ is converging slowly, the learning rate too small.
While GD uses all samples in each iteration, SGD uses only one. A possible middle ground is to use a mini batch of samples in each iteration.
Where
To prevent large weights, the norm of the weights is added to the loss function:
Other way to improve is to prevent overfitting:
- Separate the data into training and validation sets.
- Minimize
$L(\vec{W})$ on the training set, stopping when$L(\vec{W})$ on the validation set stops improving.
This approach introduces one or more hidden layers in the network, each with one or
more neurons.
The model for a hidden layer
The aggregation of the outputs of the layer defines the input for the neurons in the next layer
In practice, the layer’s weights are aggregated in a matrix, performing the calculation
in a single take.
One implication is that the number of neurons in the hidden layers is directly
proportional to the model’s complexity.
- Starting with random weights.
- For each sample, calculate the model, and if the result is incorrect:
a. Calculate local gradients for each neuron.
For the neuron$l$ in the last layer$k$ :For the hidden neurons, let
$i^+$ be the attached neuron in the next layer:b. Update the weights with the delta rule.
Let $wh,i,j^+$ be the updated weight, $wh,i,j$ the current weight, and $wh,i,j^-$ the previous weight:Where
$γ$ is the momentum, a constant defined to prevent local optima.
The VC dimension of a model is the higher number of samples for which it can solve any
learning problem.
Therefore, the VC dimension is an estimate of the capacity of a model. \
The VC dimension for a model
To reduce the test error:
- Keep the training error low.
- Minimize
$\text{VC}(f)$ .
By limiting the data to a sphere, we can place a bound on the VC dimension.
Let
Therefore, by maximizing
A kernel allows one to map the entries to a higher dimensional feature space, possibly
allowing simpler ways to delimit such entries.
One example is the polynomial kernel:
- Linear SVMs are similar to a Perceptron, but with an optimal cost function.
- If a Kernel is used, then SVMs are comparable to 2-layer neural networks.
- A 3-layer neural network might correspond to an ensemble of multiple Kernel SVMs.
Assuming conditional independence between the input dimensions, the probability of the target can be approximated using the Bayes theorem:
@@latex:\newpage@@
Ensemble learning consists in combining several simple models to form a more complex model.
- Bagging:
- Each model training with a different dataset
- Boosting:
- Same dataset, but instrumented for each model to mitigate the weakness of others
Boosting is the technique of combining simple models iteratively to create a complex
model.
Each model is intentionally biased to avoid the errors of the previous model. \
One simple method of boosting is the additive boosting:
Considering binary classifiers
The model is defined as
The adaptive boosting algorithm is an additive algorithm, with associated importances:
The adaboost algorithm is always based on very simple models, usually decision
stumps.
As a consequence, it does not overfit.
Boostrap aggregation is the technique of combining models trained in subsets of the
training dataset.
The subsets are constructed by uniformly sampling the dataset, and may contain
intersections. \
Small subsets prevent the base models from overfitting, and therefore bagging
circumvents variance in the data. \
The models may be combined using many techniques:
- Majority voting.
- Averaging probabilites.
- Averaging estimates.
- Etc.
In practice, the base models are usually decision trees.
Random forests exploits randomness in instances and features.
Each decision tree is trained with a random subset of features and instances. \
As a consequence, random forests circumvent overfitting in decision trees.
Semi-supervised learning is the method of combining supervised and unsupervised learning, usually when there are small ammounts of labeled data, and large amounts of unlabeled data.
Active learning is a technique to create optimal training sets, by filtering samples
with redundant information.
The technique is commonly used in two situations:
- Semi-supervised:
- use active learning to obtain a small optimal subset of samples to label manually.
- Supervised:
- use active learning to balance classes, e.g. for classifying anomalies, where there are few positive samples.
@@latex:\vspace{5px}@@ There are many methods for selecting samples:
- Uncertainty sampling:
- Select the samples for which the model is least certain about the correct output.
- Committee:
- Train a comittee of models on the labeled data. Select the unlabeled samples for which the committee disagrees most.
- Expected model change:
- Select the samples with greater impact on the model.
- Expected error reduction:
- Select the samples with greater impact on the model’s generalization error.
Unsupervised learning consists to, given only inputs as training, find a pattern:
- Clusters
- Manifolds
- Embeddings
- Etc.
Some common distance functions are:
- Nearest neighbor:
- $min({|x - y|}^2)$
- Furthest neighbor:
- $max({|x - y|}^2)$
- Centroid:
- ${|μ_i - μ_j|}^2$
The hierarchical agglomerative clustering technique constructs a dendogram based on a
distance function.
Starting with individual clusters, it iteratively merges the closest ones until the
dendogram is complete. Finally, a cut across the dendogram corresponds to a similarity
threshold.
The K-Means algorithm constructs clusters by placing centroids and agglomerating by the
closest centroid.
Considering
- Each cluster is defined by the points that are closest to the correspondent centroid.
- The centroids are updated with the mean of the points in it’s cluster.
This technique can be interpreted as optimizing a loss function
Different initial centroids may lead to different final losses, motivating different initialization methods:
- Random:
- May choose nearby points
- Distance based:
- Limits the search space, not enough randomness.
- Random and distance based:
- Choose far points randomly.
The choice of
- Small $k$:
- loose clustering.
- Large $k$:
- any point is a group itself.
To prevent an overly large
Where
The mixture model constructs clusters by assigning probabily distributions and
agglomerating by the most probable distribution. Therefore, it differs from K-Means by
allowing overlaping clusters.
As in the K-Means algorithm, the distributions are randomly initialized, and updated
using an expectation-maximization algorithm.
Method where you train the program by rewarding the learning algorithm positively or negatively according to the produced results. This method is similar to how we teach animals. @@latex:\pagebreak@@
Deep learning is suitable for problems where the features are not well defined:
- Images
- Text
- Audio
Convolutional neural networks are composed of two sets of layers:
- Alternate layers of:
- Convolution:
- A mathematical method to identify correlation, extracting features from the input.
- Pooling:
- A method to reduce dimensionality without loosing information. @@latex:\vspace{3pt}@@
- A multilayer perceptron, i.e. a neural network.
The convolution step is parameterized by:
- Depth:
- the number of filters
- Stride:
- the slide step
- Zero-padding:
- whether to add padding to allow patterns in the borders.
As in convolutional networks, the key technique is how to extract features from the
input.
To learn how to represent data, a network composed of two sets of layers is used to
predict the next word in the text:
- The input layer.
- A multilayer perceptron, i.e. a neural network.
For training, each word in the vocabulary is initally modeled as a random vector.
By allowing training of the input layer itself, the network learns how to significantly model the inputs for the given dataset. Therefore, after training, the network has not only produced the prediction model, but also a feature model in the first layer. This feature model can be used as input layer for any other network, including networks with other targets.
The autoencoder is a neural network with one hidden layer that implements a lossy
compression machine. By compressing and decompressing the input, it removes irrelevant
data due to the loss. Such network can be trained as it was supervised, as the
expected output is the input itself.
Because the hidden layer has a smaller dimension than the input, it can be used with the input layer as a feature model that reduces the dimensionality.
Generative adversarial networks aim to train a generative network, using an
discriminative network.
Both networks compete in a zero-sum game, that converges when both players don’t
change their output given the oponent’s output.
The error function for the generator is whether the discriminator missed.
GANs are hard to train, as each network can underpower the other. If the discriminator is too good, it will be so confident that the generator will struggle to read the gradient. If the generator is too good, it will persistently exploit weaknesses in the discriminator that lead to false negatives. This may be mitigated by the nets’ respective learning rates. Also, each network must be trained against a static adversary, so it’ll have enough rounds to actually learn something.