Skip to content

Commit

Permalink
fix: image refs in agnelogalav notes
Browse files Browse the repository at this point in the history
Co-authored-by: "Angelo Galavotti" <[email protected]>
  • Loading branch information
foxyseta and AngeloGalav committed Oct 25, 2023
1 parent 60573b6 commit dad0f0c
Show file tree
Hide file tree
Showing 19 changed files with 140 additions and 140 deletions.
22 changes: 11 additions & 11 deletions appunti/2022-angelo-galavotti-notes/2023-02-21-first-lesson.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ This form of iteration over data can be understood as a way of _progressive lear

## Gradient descent (again)

![](gradient-descent-example.png)
![](images/gradient-descent-example.png)

- backpropagation is applied

Expand All @@ -41,19 +41,19 @@ This form of iteration over data can be understood as a way of _progressive lear
- _supervised learning_: inputs + outputs (labels)
- classification
- regression
![](supervised.png)
![](images/supervised.png)
- _unsupervised learning_: just inputs
- clustering
- component analysis
- anomaly detection
- autoencoding
![](unsupervised.png)
![](images/unsupervised.png)
- _reinforcement learning_: actions and rewards - learning long-term gains - planning - Sometimes, you have local rewards - The purpose is not to optimize the local reward, but the future locative reward, since we are not just interested only in the current situation, but rather in all the future evolutions of the agents.
![](reinforcement.png)
![](images/reinforcement.png)

### Classification vs. Regression

![](screenshot-20230221-100242.png)
![](images/screenshot-20230221-100242.png)

### Many different techniques

Expand All @@ -62,7 +62,7 @@ This form of iteration over data can be understood as a way of _progressive lear
## Neural Networks

It's a network of artificial neuron:
![](neural-networks.png)
![](images/neural-networks.png)
Each neuron takes multiple inputs and produces a single output (that can be passed as input to many other neurons).

We have an input layer, an output layer, an a set of hidden layers.
Expand All @@ -71,7 +71,7 @@ If a network has only one hidden layer, it is called a shallow networks, but if

### Artificial neuron

![](artificial-neuron.png)
![](images/artificial-neuron.png)

We have a linear combination of the inputs, which is in turn given to an **activation function** (i.e. a sigmoid).
**Each neuron** implements a _logistic regressor_: $$\sigma(wx +b)$$Machine learning tells us that the logistic regression is a valid technique.
Expand All @@ -83,22 +83,22 @@ We use a linear combination since we want to keep each node simple: computing th
#### Different activation functions

Each _activation function_ is responsible for _threshold triggering_.
![](activation-functions.png)
![](images/activation-functions.png)

Activation functions introduce non-linearity (?)
The sigmoid function is a kind of approximation of a threshold, binary function.

### The real, cortical neuron

![](real-neuron.png)
![](images/real-neuron.png)
The neuron has a dendritic tree, which ends in the sinapses. Each dendritic tree is a set of weighted inputs, which are combined. When a triggering threshold is exceeded, the Axon Hillock generates an impulse that gets transmitted through the axon to other neurons.

- Otherwise, if the sum is below a certain threshold, then it is blocked.

### Comparisons

- ANN vs real neuron
![](neuron-comparison.png)
![](images/neuron-comparison.png)

- The human brain has a number
- The human brain is not so fast since it is not an electrical transmission, but rather a chemical reaction. The switching time is 0.001 s.
Expand Down Expand Up @@ -147,6 +147,6 @@ The kernel is shifted (moved to the next position, slided to the next position o

- This operation is called **convolution**.

![](convolutional-layer.png)
![](images/convolutional-layer.png)

## Parameters and hyper-parameters
2 changes: 1 addition & 1 deletion appunti/2022-angelo-galavotti-notes/2023-02-22-intro-2.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ In case of deep learning, we have the input, the features, then more complex fea


Here a
![](relation-between-areas.png)
![](images/relation-between-areas.png)

# Diving into DL
\[..\]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,6 @@

## Generative modeling
Train a _generator_ able to sample original data similar to those in the training set, implicitly learning the _distribution of data_.
![](generator.png)
![](images/generator.png)
- the randomicity of the generator is provided by a _random seed_ (noise) received as input.
-
Original file line number Diff line number Diff line change
Expand Up @@ -6,49 +6,49 @@ TODO:
This lesson is focused in what we can compute with a NN.

Suppose we have a single layer NN.
![](perceptron.png)
![](images/perceptron.png)

For the moment, let's suppose that we have a binary function as an activation function:
![](perceptron-formula.png)
![](images/perceptron-formula.png)
The bias allow us to _fix_ the _threshold_ that we're _interested_ in.

## Hyperplane
The set of points:
![](simple-equation.png)
![](images/simple-equation.png)
defines a hyperplane in the space of the variables $x_i$.
![](line-example.png)
![](images/line-example.png)

The _hyperplane_ divides the space in _two parts_:
- to one of them (above the line) the perceptron gives value 1,
- to the other (below the line) value 0.

### NN logical connections
![](nand.png)
![](images/nand.png)
and the answer is...
![](linear-perceptron-nand.png)
![](images/linear-perceptron-nand.png)

But we _cant_ represent _every_ circuit with a linear perceptron (i.e. XOR).

Can we recognize these patterns with a perceptron (aka binary threshold)?
![](pixels-lp.png)
![](images/pixels-lp.png)
__No__, each pixel should individually contribute to the classification, that is not the case (more in the next slides).
So considering more than one pixel at a time it's not a linear task.

Let us e.g. consider the first pixel, and suppose it is black (the white case is symmetric).
![](pixels-lp-2.png)
![](images/pixels-lp-2.png)
does this improve our knowledge for the purposes of classification?
No, since we have still the same probability to have a good or a bad example.

##### MNIST Example
Can we address digit recognition with linear tools? (perceptrons, logistic regression, . . . )
When we want to use a linear technique for learning, we have to ask ourself, is each one of the features informative by itself or should consider them in a particular context?![](digits.png)
When we want to use a linear technique for learning, we have to ask ourself, is each one of the features informative by itself or should consider them in a particular context?![](images/digits.png)

## Multi-layer perceptrons
- we know we can _compute nand_ with a perceptron
- we know that nand is logically complete (i.e. we can compute any connective with nands)
- so: why perceptrons are not complete?
- answer: because we need _to compose them_ and consider _Multi-layer perceptrons_.
![](xor-perceptrons.png)
![](images/xor-perceptrons.png)

So... since shallow networks are already complete, why going for _deep networks_?
With deep nets, the same function may be computed with _less neural units_ (Cohen, et al.)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is akin to reinforcement learning, but is _very inefficient_ and has a high


Instead of making a random adjustement of the parameters, we try to predict the parameters.
![](reducing-loss.png)
![](images/reducing-loss.png)
$w$ is the current value of the loss function, and we should move left (in this example) to decrease the loss.
We also have to compute how long we have to move to decrease the loss.

Expand All @@ -27,7 +27,7 @@ Obviously, the mathematical tool we need to find the movement direction are _der

##### Why binary threshold is no good for learning
Derivative is 0 everywhere (and infinite in correspondence of the jump).
![](threshold-function.png)
![](images/threshold-function.png)


## The gradient
Expand All @@ -36,7 +36,7 @@ If we have many parameters, we have a different derivative for each of them (the

With multiple parameters, the magnitude of partial derivatives becomes relevant, since it governs the orientation of gradient.

![](nabla-stuff.png)
![](images/nabla-stuff.png)

The gradient points in the direction of _steepest __ascent___.

Expand Down Expand Up @@ -69,7 +69,7 @@ We want to fit a line through a set of points $\langle x_i, y_i \rangle$.
- Loss: $1/2 * \sum_i (y_i - (ax_i + b))^2$
- $\dfrac{\partial L}{\partial a} = - \sum_i ((y_i - ax_i + b)x_i)$
- $\dfrac{\partial L}{\partial b} = - \sum_i (y_i - ax_i + b)$
![](line-fit-demo.png)
![](images/line-fit-demo.png)
The previous problem is a linear optimization problem, that can be easily solved analytically. Why taking a different approach? I the analytic solution only works in the linear case, and for fixed error functions I usually, it is not compatible with regularizers I the backpropagation method can be generalized to multi-layer non-linear networks

#### Some notes on gradient descent
Expand All @@ -95,7 +95,7 @@ Each asnwer to these questions returns an new _optimizer_ definition.


### Online vs Batch learning
![](online-vs-batch-learning.png)
![](images/online-vs-batch-learning.png)
- Fullbatch on all training samples: gradient points to the direction of steepest descent on the error surface (perpendicular to contour lines of the error surface)
- Online (one sample at a time) gradient zig-zags around the direction of the steepest descent.
- Minibatch (random subset of training samples): a good compromise.
Expand All @@ -108,7 +108,7 @@ We make less precise updates more frequently.

## Momentum
If, during consecutive training steps, the gradient seems to follow a stable direction, we could improve (increase) its magnitude (we increase the gradient ), simulating the fact that it is acquiring a momentun along that direction, _similarly to a ball rolling down a surface_.
![](momentum.png)
![](images/momentum.png)
The hope is to reduce the risk to get stuck in a local minimum, or a plateau.
There's no theoretical justification.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ $$
h'(x) = f'(g(x)) * g'(x) = f'(y) * g'(x) = \dfrac{df}{dy} * \dfrac{dg}{dx}
$$
Given a multivariable function f (x, y) and two single variable functions x(t) and y(t)
![](composition-of-derivatives.png)
![](images/composition-of-derivatives.png)
In vector notation: let ...

$a^l = σ(b^l+ w^l · x^l )$
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Overfitting essentially tells us to keep in mind that the training data we have
> Deep models are good at fitting, but the real goal is generalization
> - With NN models, we see mainly overfitting problems.
![](overfitting.png)
![](images/overfitting.png)

## Ways to reduce overfitting
- __Collect more data:__
Expand Down Expand Up @@ -111,17 +111,17 @@ The output computed, usually, is a probability distribution.
As we know, the __loss function__ is the _difference_ between the _actual output_ of the model and the _ground truth_. The problem is, what loss functions should we use for _comparing probability distributions_?
- We could treat them as “normal functions”, and use e.g. _quadratic distance_ between true and predicted probabilities.
- Can we do better? For instance, in logistic regression we do not use mean squared error, but use negative log-likelihood. Why?
![](comparing-lossess.png)
![](images/comparing-lossess.png)
Probability distributions can be compared according to many different metrics. There are two main techniques:
- you consider their _difference_ $P - Q$ (e.g. Wasserstein distance, it tries to measure the amount of "work" needed to reshape the curve to the one of the ground truth)
- you consider their _ratio_ $P/Q$ (e.g. __Kullback Leibler divergence__)

### Kullback-Leibler divergence
The __Kullback-Leibler divergence__ $DKL(P||Q)$ between two distributions $Q$ and $P$, is a _measure of the information loss due to approximating $P$ with $Q$_.
![](dkl.png)
![](images/dkl.png)

We call __Cross-Entropy__ between $P$ and $Q$ the quantity:
![](x-entropy.png)
![](images/x-entropy.png)

Since, given the training data, their entropy $H(P)$ is constant, minimizing $DKL(P||Q)$ is equivalent to minimizing the cross-entropy $H(P, Q)$ between $P$ and $Q$.

Expand All @@ -133,7 +133,7 @@ A learning objective can be the _minimization_ fo the Kullback-Leiber divergence

### Cross-Entropy and Log-likelihood
To better understand, let us consider the case of the binary classification.
![](cross-entropy-log-likelihood.png)
![](images/cross-entropy-log-likelihood.png)

If $x = 1$, then the probability of $P(y = 1 | x)$ is 1.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ To determine the neighbourhood of each pixel, we usually just select a kernel of

## Filters and convolutions
We have a grid of weights (a kernel or filter), which we then slide.
![](convolution.png)
![](images/convolution.png)

As we've said:
- the activation of a neuron is not influenced from all neurons of the previous layer, but only from a _small subset of adjacent neurons_: his __receptive field__.
Expand All @@ -33,7 +33,7 @@ $$
We essentially are computing $\dfrac{y_1 - y_0}{x_1 - x_0}$.

Usually, $h = 1$ (since we can't take 0) pixel, and negleting the costant 1/2 we compute with the following filter $$[-1 \ 0 \ 1]$$This kernel is quite interesting in image processing and allows us to approximate a derivative of the input image (w.r.t. the difference of the intensity of the pixel) in a specific position. This kernel can be applied both _horizontally_ and _vertically_.
![](finite-central-example.png)
![](images/finite-central-example.png)
From the input image we extract the visible contours, using different orientations of the kernel.

In general, the kernel is a _pattern_ of the image _that we are interested in_. We can have many, complex patterns, we look for this pattern over the input.
Expand All @@ -58,7 +58,7 @@ array([[-1., 0., 1.],
```

And the result is:
![](result-cnn-1.png)
![](images/result-cnn-1.png)

A kernel that would shift the image looks something like this:
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,14 @@ So, each CNN layer usually has 4 dimensions:
What do we do when our layer has more than 1 channel? In this case, ==the kernel operates on all the channels together==.
For this reason, a kernel of 1x1 operates just on the channels.

![](layer-configuration-params.png)
![](images/layer-configuration-params.png)

##### Reducing channels
If in a layer we reduce the channels, i.e. 64 to 32, our CNN kinda behaves like PCA, since it compresses the 64 channels to 32, preserving the information that is important to us.

Unless stated differently (e.g. in separable convolutions), ==_a filter operates on all input channels in parallel_==. So, if the input layer has depth $D$, and the kernel size is $N \times M$, the actual dimension of the filter will be $N \times M \times D$.
The convolution kernel is tasked with _simultaneously mapping cross-channel correlations and spatial correlations_.
![](3d-cnn.png)
![](images/3d-cnn.png)

### Dimension of the output
The spatial dimension of each output feature map depends form the spatial dimension of the input, the padding, and the stride. Along _each axes_ the dimension of the output is given by the formula:
Expand All @@ -48,7 +48,7 @@ where:
- The width of the input (gray) is $W$=7.
- The kernel has dimension $K$=3 with fixed weights \[1, 0, −1\]
- Padding is zero
![](example-cnn.png)
![](images/example-cnn.png)
- In the first case, the stride is $S=1$. We get $(W − K)/S + 1 = 5$ output values.
- In the second case, the stride is $S=2$. We get $(W − K)/S + 1 = 3$ output values.

Expand All @@ -68,7 +68,7 @@ This is also true in Keras.
- The __receptive field__ of a (deep, hidden) neuron is the _dimension of the input_ region _influencing_ it.
- It is equal to the dimension of an input image producing (without padding) an output with dimension 1.
- ==A neuron cannot see anything outside its receptive field!==
![](receptive-field.png)
![](images/receptive-field.png)

This notion is only true in CNNs, since in normal, dense NNs we have that each neuron is connected to all the other neurons of the previous layer.

Expand All @@ -86,7 +86,7 @@ Another way is applying __pooling operation__.
In deep convolutional networks, it is common practice to alternate convolutional layers with _pooling layers_, where each neuron simply takes the _mean_ or _maximal value_ in its _receptive field_. This has a double advantage:
- it reduces the dimension of the output
- it gives some tolerance to translations:
![](pooling.png)
![](images/pooling.png)
While the mean-pooling operation can be applied through a convolution (same kernel as the blurring kernel), the _max-pooling_ can't be expressed through a convolution.

Usually, when downsampling, we are also _doubling_ the channel dimention, otherwise the reduction would be too drastic.
Expand All @@ -101,23 +101,23 @@ So, which is the more expressive method? We cannot say, the two methods are simp
## Famous CNNs

#### AlexNet
![](alexnet.png)
![](images/alexnet.png)
This one is interesting since it proves again that we dont a long stack of deep neural layers: at the end of this network, there are just 2 dense layers, and are enough to extract complex features.

This net also uses very a big kernel (11x11), while now the standard is basically using a 3x3 kernel.

#### VGG
![](vgg.png)
![](images/vgg.png)

## Inception modules
Inception modules re-combine together results and try to resythesize together new features.
![](inception-module.png)
![](images/inception-module.png)
As we can see from this image, you are basically applying some different operations, which are then stacked together along the channel dimension (through concatenation).

### Inception hypothesis and Depthwise Separable Convolution
Remember that normal convolutional kernels are 3D, _simultaneously_ mapping _cross-channel correlations_ and _spatial correlations_. It can be better to decouple them, independently looking for cross-channel correlations (via 1 × 1 convolutions), and spatial 2D convolutions.
This operation is illustrated in the following picture:
![](inception-hypotesis.png)
![](images/inception-hypotesis.png)
In Deptwise Separable Convolutions, we have a kernel for each channel, which is applied separately. Each kernel produces a single output, so the number of feature maps in input is the same as the number of features map in output.
We then reduce the number of feature maps (called $C_{out}$) through a single unary convolution.

Expand Down
Loading

0 comments on commit dad0f0c

Please sign in to comment.