Supervised learning

Method where you train the program by feeding the learning algorithm with a mapping of inputs to correct outputs.

Regression

Regression is curve fitting: learn a continuous input $→$ output mapping from a set of examples.

Classification

Outputs are discrete variables (category labels). Learn a decision boundary that separates one class from the other. Generally, a confidence is also desired, i.e., how sure are we that the input belongs to the chosen category.

Training set

The training set is a set of $m$ $(\vec{X},\, y)$ pairs, where:

Error function

The loss function for a model $f: \vec{X} \mapsto y$ parameterized by $\vec{W}$ applied to a dataset $\{ (\vec{X},\, y) \}$ of size $m$ is:

Perceptron

Perceptron is the trivial neural network. The model for a parameter $\vec{W} = (\text{threshold},\, w_1,\, \hdots,\, w_d)$ and inputs of the form $(1,\, x_1,\, \hdots,\, x_d)$ is given by

Where $\sign$ is the activation function.
If $x_i$ is evidence for approval, then $w_i$ should be high. \ If $x_i$ is evidence for denial, then $w_i$ should be low.

Learning algorithm

The learning algorithm of the Perceptron is quite simple. The learning rate $∈ (0,\, 1]$ is used to scale each step. the For a training set $S = \{ \, (\vec{X}_1,\, y_1),\enspace (\vec{X}_2,\, y_2),\enspace \hdots \, \}$

Starting with random weights, then show each sample in sequence repetitively.
If the output is correct, do nothing.
If the produced output is negative, and the correct output is positive, increase the weights.
If the produced output is positive, and the correct output is negative, decrease the weights.
The amount to increase/decrease is given by the current sample scaled by the learning rate.

Error

The error function for a model $f$ in a training sample is

This function is known and calculable.

The error function for a model $f$ in a hypothetical sample is

This function is not known, and only approachable.

A good approximation of $E_\text{out}$(f) is the error in a test (or validation) sample

Given a model $f$ in a set of $M$ models, the bound for the probability of the error deviation surpassing a given $ε$ is

Notably, $E_\text{in}(f)$ and $E_\text{out}(f)$ deviates as $f$ becomes complex.

Empirical error minimization

During the learning algorithm, always conserve the weights that produce the lower error.
This has a disadvantage: It memorizes the training set.

Learning decision trees

Each layer in the tree consists of an attribute that splits the data into subsets that are ideally disjoint.
The entropy of the subsets produced is a measure of how disjoint they are. \

For a set containing $p$ positive and $n$ negatives, the entropy is

A given attribute $A$, with $k$ distinct values, divides the training set $S$ into subsets $S_1, S_2, \hdots, S_k$.
The expected entropy remaining after applying $A$ is

The information gain, i.e. the reduction in entropy for $A$, is

Capacity

The capacity is a measure of when the training error is a good approximation for the test error.

Bias and variance

Bias is the error due to the fact that the set of functions does not contain the target function.

Variance is the error due to the fact that if we had been using another training set drawn from the same distribution, we would have obtained another function.

Regularization is a method for minimizing the training error, as long as it is still a good approximation for the test error, trading-off accuracy for simplicity.

Single layer neural networks

Using the sigmoid as the activation function, and the squared-error loss function:

To find in which direction the weights minimizes $L$, the gradient is used:

Where the delta rule is

And the slope of ligistic is

@@latex:\newpage@@

Gradient descent algorithm

The learning rate $r ∈ (0,\, 1]$ is used to scale each step.

Starting with random weights.
Compute $∇ L(\vec{W})$.
$\vec{W} ← \vec{W} - r ⋅ ∇ L(\vec{W}) = \vec{W} - r ⋅ ∑\limits_i^m Δ Ψ$
Repeat steps 2 and 3 until $\vec{W}$ doesn’t change anymore $(10^-5)$.

After each iteration, $L(\vec{W})$ should be checked:

If $L(\vec{W})$ is converging, the learning rate is correct.
If $L(\vec{W})$ is diverging, the learning rate is too large.
If $L(\vec{W})$ is converging slowly, the learning rate too small.

Also, the algorithm needs feature scaling

Stochastic gradient descent

Instead of inspecting the whole dataset to detect the direction which minimize $L$, a single random sample is picked on each step.

Randomly shuffle the training set.
Starting with random weights.
For each sample $(\vec{X_i}, y_i)$: $\>\vec{W} ← \vec{W} - r ⋅ Δ Ψ$
Repeat step 3 until $\vec{W}$ doesn’t change anymore $(10^-5)$.

Convergence is not so obvious. After each bulk of iterations, e.g. 1000, check $L(\vec{W})$:

If $L(\vec{W})$ is converging, the learning rate is correct.
If $L(\vec{W})$ is diverging, the learning rate is too large.
If $L(\vec{W})$ is converging slowly, the learning rate too small.

Mini batches

While GD uses all samples in each iteration, SGD uses only one. A possible middle ground is to use a mini batch of samples in each iteration.

Where $b$ is the batch size, tipically $10$.

Regularization

To prevent large weights, the norm of the weights is added to the loss function:

Early stopping (cross validation)

Other way to improve is to prevent overfitting:

Separate the data into training and validation sets.
Minimize $L(\vec{W})$ on the training set, stopping when $L(\vec{W})$ on the validation set stops improving.

Multi layered neural networks

This approach introduces one or more hidden layers in the network, each with one or more neurons.
The model for a hidden layer $h$ is the aggregation of the models of each neuron $i$ in the layer.

The aggregation of the outputs of the layer defines the input for the neurons in the next layer

In practice, the layer’s weights are aggregated in a matrix, performing the calculation in a single take.
One implication is that the number of neurons in the hidden layers is directly proportional to the model’s complexity.

Backpropagation

Starting with random weights.
For each sample, calculate the model, and if the result is incorrect: a. Calculate local gradients for each neuron.
For the neuron $l$ in the last layer $k$:
For the hidden neurons, let $i^+$ be the attached neuron in the next layer:

b. Update the weights with the delta rule.
Let $w_h,i,j^+$ be the updated weight, $w_h,i,j$ the current weight, and $w_h,i,j^-$ the previous weight:

Where $γ$ is the momentum, a constant defined to prevent local optima.

Support Vector Machines

The VC dimension of a model is the higher number of samples for which it can solve any learning problem.
Therefore, the VC dimension is an estimate of the capacity of a model. \

The VC dimension for a model $f$ and a training set of size $n$ is also a bound on the test error

To reduce the test error:

Keep the training error low.
Minimize $\text{VC}(f)$.

By limiting the data to a sphere, we can place a bound on the VC dimension.
Let $d$ be the dimensionality of the data, $D$ the diameter of the sphere, and $ρ$ the margin of the model

Therefore, by maximizing $ρ$, $\text{VC}(f)$ becomes independent of the dimensionality of the data.

Kernels

A kernel allows one to map the entries to a higher dimensional feature space, possibly allowing simpler ways to delimit such entries.
One example is the polynomial kernel:

Neural networks versus SVMs

Linear SVMs are similar to a Perceptron, but with an optimal cost function.
If a Kernel is used, then SVMs are comparable to 2-layer neural networks.
A 3-layer neural network might correspond to an ensemble of multiple Kernel SVMs.

Naive Bayes

Assuming conditional independence between the input dimensions, the probability of the target can be approximated using the Bayes theorem:

@@latex:\newpage@@

Ensemble learning

Ensemble learning consists in combining several simple models to form a more complex model.

Bagging:: Each model training with a different dataset
Boosting:: Same dataset, but instrumented for each model to mitigate the weakness of others

Boosting

Boosting is the technique of combining simple models iteratively to create a complex model.
Each model is intentionally biased to avoid the errors of the previous model. \

One simple method of boosting is the additive boosting:
Considering binary classifiers

The model is defined as

Adaboost

The adaptive boosting algorithm is an additive algorithm, with associated importances:

The adaboost algorithm is always based on very simple models, usually decision stumps.
As a consequence, it does not overfit.

Bagging

Boostrap aggregation is the technique of combining models trained in subsets of the training dataset.
The subsets are constructed by uniformly sampling the dataset, and may contain intersections. \ Small subsets prevent the base models from overfitting, and therefore bagging circumvents variance in the data. \

The models may be combined using many techniques:

Majority voting.
Averaging probabilites.
Averaging estimates.
Etc.

In practice, the base models are usually decision trees.

Random forests

Random forests exploits randomness in instances and features.
Each decision tree is trained with a random subset of features and instances. \ As a consequence, random forests circumvent overfitting in decision trees.

Semi-supervised learning

Semi-supervised learning is the method of combining supervised and unsupervised learning, usually when there are small ammounts of labeled data, and large amounts of unlabeled data.

Active learning

Active learning is a technique to create optimal training sets, by filtering samples with redundant information.

The technique is commonly used in two situations:

Semi-supervised:: use active learning to obtain a small optimal subset of samples to label manually.
Supervised:: use active learning to balance classes, e.g. for classifying anomalies, where there are few positive samples.

@@latex:\vspace{5px}@@ There are many methods for selecting samples:

Uncertainty sampling:: Select the samples for which the model is least certain about the correct output.
Committee:: Train a comittee of models on the labeled data. Select the unlabeled samples for which the committee disagrees most.
Expected model change:: Select the samples with greater impact on the model.
Expected error reduction:: Select the samples with greater impact on the model’s generalization error.

Unsupervised learning

Unsupervised learning consists to, given only inputs as training, find a pattern:

Clusters
Manifolds
Embeddings
Etc.

Distance function

Some common distance functions are:

Nearest neighbor:: $min({|x - y|}^2)$
Furthest neighbor:: $max({|x - y|}^2)$
Centroid:: ${|μ_i - μ_j|}^2$

Hierarchical agglomerative clustering

The hierarchical agglomerative clustering technique constructs a dendogram based on a distance function.
Starting with individual clusters, it iteratively merges the closest ones until the dendogram is complete. Finally, a cut across the dendogram corresponds to a similarity threshold.

K-Means

The K-Means algorithm constructs clusters by placing centroids and agglomerating by the closest centroid.
Considering $k$ clusters, place $k$ centroids in the space. Update the centroids iteratively using an expectation-maximization algorithm:

Each cluster is defined by the points that are closest to the correspondent centroid.
The centroids are updated with the mean of the points in it’s cluster.

This technique can be interpreted as optimizing a loss function

Different initial centroids may lead to different final losses, motivating different initialization methods:

Random:: May choose nearby points
Distance based:: Limits the search space, not enough randomness.
Random and distance based:: Choose far points randomly.

The choice of $k$ also impacts the disposition of results:

Small $k$:: loose clustering.
Large $k$:: any point is a group itself.

To prevent an overly large $k$, we must penalize complexity (regularization):

Where $d$ is the dimensionality of the data, which also plays a fundamental role. With more dimensions, points tend to get sparse. Therefore, the more dimensions there are, the more samples one will need.

Mixture model

The mixture model constructs clusters by assigning probabily distributions and agglomerating by the most probable distribution. Therefore, it differs from K-Means by allowing overlaping clusters.
As in the K-Means algorithm, the distributions are randomly initialized, and updated using an expectation-maximization algorithm.

Reinforcement learning

Method where you train the program by rewarding the learning algorithm positively or negatively according to the produced results. This method is similar to how we teach animals. @@latex:\pagebreak@@

Deep learning

Deep learning is suitable for problems where the features are not well defined:

Images
Text
Audio

Convolutional neural networks

Convolutional neural networks are composed of two sets of layers:

Alternate layers of:

Convolution:
A mathematical method to identify correlation, extracting features from the input.

Pooling:
A method to reduce dimensionality without loosing information. @@latex:\vspace{3pt}@@
A multilayer perceptron, i.e. a neural network.

The convolution step is parameterized by:

Depth:: the number of filters
Stride:: the slide step
Zero-padding:: whether to add padding to allow patterns in the borders.

Natural language processing

As in convolutional networks, the key technique is how to extract features from the input.
To learn how to represent data, a network composed of two sets of layers is used to predict the next word in the text:

The input layer.
A multilayer perceptron, i.e. a neural network.

For training, each word in the vocabulary is initally modeled as a random vector.

By allowing training of the input layer itself, the network learns how to significantly model the inputs for the given dataset. Therefore, after training, the network has not only produced the prediction model, but also a feature model in the first layer. This feature model can be used as input layer for any other network, including networks with other targets.

Unsupervised feature learning

Autoencoder

The autoencoder is a neural network with one hidden layer that implements a lossy compression machine. By compressing and decompressing the input, it removes irrelevant data due to the loss. Such network can be trained as it was supervised, as the expected output is the input itself.

Because the hidden layer has a smaller dimension than the input, it can be used with the input layer as a feature model that reduces the dimensionality.

Generative adversarial networks

Generative adversarial networks aim to train a generative network, using an discriminative network.
Both networks compete in a zero-sum game, that converges when both players don’t change their output given the oponent’s output.

The error function for the generator is whether the discriminator missed.

GANs are hard to train, as each network can underpower the other. If the discriminator is too good, it will be so confident that the generator will struggle to read the gradient. If the generator is too good, it will persistently exploit weaknesses in the discriminator that lead to false negatives. This may be mitigated by the nets’ respective learning rates. Also, each network must be trained against a static adversary, so it’ll have enough rounds to actually learn something.

Files

notebook.org

Latest commit

History