Back Prop question

How is back prop used to determine the optimal values for weight and bias parameters?

What is the algorithm.

Hi @ai_is_cool,
Backpropagation computes the gradients of a loss function with respect to each parameter (weights and biases), which are then used to update those parameters in a way that minimizes the loss.
Using the chain rule, the partial derivatives of the loss with respect to each parameter are computed. These derivatives (gradients) tell us how to change each parameter to reduce the loss. Each parameter is updated by subtracting a small fraction (learning rate, \alpha) of its gradient.

Don’t you mean the gradients of the cost function J(w, b)?

So starting with values for w^{[n]}, b^{[n]} X^{(i)} and y^{(i)}; 1 \le n \le N

Step 1. J(w, b) is computed by forward prop and its gradient with respect to w_j and b are also computed by back prop - correct?

Step 2. The gradient descent algorithm can then be used to update w_j and b using estimates of \frac{\partial J(w, b)}{\partial w_j} and \frac{\partial J(w, b)}{\partial b}.

Then the first step above can then be repeated and gradient descent repeated also.

Optimal values for the weight and bias parameters can then be determined by repeating the above two steps for a given number of epochs - 1. Correct?

Or am I wrong?

How are the parameters updated to optimal values for each layer in a 3 layer network like the MNIST handwritten digit example?

First, let’s agree on the notation. Let’s say we have a neural network with L layers, each layer has a weight matrix W^{[l]} and a bias vector b^{[l]}, 1 \le l \le L. We denote matrices with capital letters and vectors with lowercase letters. Each hidden layer l < L uses a non-linearity g (it can be ReLU, sigmoid, tanh, etc.) The inputs of the network are a vector x^{(i)} and a scalar y^{(i)}; X is a full / mini batch of training examples. The cost function J depends on W^{[l]}, b^{[l]} \forall l. We use {\mathcal L^{(i)}} to denote the loss for i-th training example.
You are right, in the first step we compute J by forward prop.
Next we compute {\partial J \over \partial W^{[l]}} and {\partial J \over \partial b^{[l]}} for l = L, L-1, \dots, 1. Parameters update is a part of the optimization algorithm such as SGD, Adam, etc. All the above steps are repeated for a fixed number of epochs or until we reach a certain value of loss / accuracy.

Ok, notation agreed.

So we compute the gradients for layer l = L first, then the previous layer l = L - 1?

How are the parameters updated to optimal values for each layer in a 3 layer network like the MNIST handwritten digit example?

Excellent question! I am going to address it tomorrow because it is late in my timezone.

Ok, no problem.

Thanks.

So we compute the gradients for layer l = L first, then the previous layer l = L − 1?

Yes, we begin from the last layer and going back to the first one. This is why it’s called back propagation.

Ok but when computing the gradients for the last layer, when need the output a^{[L - 1]} from the previous layer - yes? And that output a^{[L -1]} is from forward prop - yes?

Let a^{[0]} = x^{(i)}, x^{(i)} \in {\mathbb R^{1 \times n}}, and {\bar y}^{(i)} \in \{0, 1\}^{1 \times N} is one-hot target vector for N classes.
The multi-layer neural network forward pass can be written as
z^{[l]} = a^{[l - 1]}W^{[l]} + b^{[l]}, a^{[l]} = g(z^{[l]}), for l = 1, \dots, L - 1
z^{[L]} = a^{[L - 1]}W^{[L]} + b^{[L]}
\displaystyle a^{[L]} = {e^{z^{[L]}} \over \sum_{j = 1}^ N e^{z^{[L]}}_j} = {\hat y}^{(i)}, where e^{z^{[L]}} \in \mathbb R^{1 \times N} is the elementwise exponential
\displaystyle {\mathcal L}^{(i)} = - {\bar y}^{(i)} \cdot \log {\hat y}^{(i)}
\displaystyle J = {1 \over m} \sum_{i = 1}^m {\mathcal L}^{(i)}.
The derivative of the loss with respect to z^{[L]} is:

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[L]}} = \hat y^{(i)} - \bar y^{(i)} \quad \in \mathbb{R}^{1 \times N}.

For l \le L - 1 we have \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}} = \left( \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l + 1]}} {W^{[l+1]}}^\top \right) \circ g'(z^{[l]}), where \circ denotes elementwise (Hadamard) product, and g' is the derivative of the activation function at layer l.

Then the gradients are:

\frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}} = {a^{[l-1]}}^\top \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}} \quad \in \mathbb{R}^{n_{l-1} \times n_l},

\frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}} =\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}} \quad \in \mathbb{R}^{1 \times n_l}.

For the cost J we have \frac{\partial J}{\partial W^{[l]}} = {1 \over m} \sum_{i = 1}^m \frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}}, and \frac{\partial J}{\partial b^{[l]}} = {1 \over m} \sum_{i = 1}^m \frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}}.

Do these mathematics assume one unit per layer?

Can you take me through your working for how you got this?

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[L]}} = \hat y^{(i)} - \bar y^{(i)} \quad \in \mathbb{R}^{1 \times N}

Here is a thread which covers that in the context of the way things are presented in DLS.

Note that MLS does not cover the details of back prop, but DLS does. One strategy here would be to “hold that thought” and then take DLS when you finish MLS.

@paulinpaloalto This thread that you cited applies to single-valued variables and not vectors. It also uses the sigmoid activation function and not the softmax activation function so it does not answer my question.

Here we assume that layer l has n_l units.

How does that tell me how many n_1 units are in layer 1, for example?

Actually I suppose the value held in this variable is the number of units in layer 1.

It is up to us to decide how many units in each layer.

1 Like

Can you take me through your working for how you got this?

For now you can check this material which discusses your question in the great detail. I’ll make this derivation later in this thread as well.

The thread I linked applies to vectors as well: all the operations from z in the last layer to the cost are “elementwise”. That’s a good point about sigmoid versus softmax. It turns out the math works out the same as with sigmoid, but the derivation is a bit more complicated. @conscell has provided a link to the CS229 material. Here’s another derivation from StackExchange.

1 Like

Taking into account the property of the one-hot vector, we expand the log of the softmax using the logarithmic difference identity:

{\mathcal L^{(i)}} = - \log {\hat y}^{(i)}_{y^{(i)}} = -\log {e^{z^{[L]}_{y^{(i)}}} \over \sum_{j = 1}^N e^{z^{[L]}}_j} = \log \sum_{j = 1}^N e^{z^{[L]}}_j - z^{[L]}_{y^{(i)}}.

Take the derivative w.r.t. each logit z^{[L]}_k:

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[L]}_k} = {\sum_{r = 1}^N {\partial \over \partial z^{[L]}_k}e^{z^{[L]}}_r \over \sum_{j = 1}^N e^{z^{[L]}}_j} - 𝟙\{k=y^{(i)}\} ={e^{z^{[L]}}_k \over \sum_{j = 1}^N e^{z^{[L]}}_j} - {\bar y}^{(i)}_k = {\hat y}^{(i)}_k - {\bar y}^{(i)}_k.

Thus, in vector notation:

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[L]}} = \hat y^{(i)} - \bar y^{(i)}.