Shape of the weights for back propagation

I am trying to replicate the full implementation of C2_W1_Lab02_CoffeeRoasting_TF using numpy instead of TensorFlow. And apparently the back propagation function is the trickiest, especially it wasn’t discussed thoroughly in the course.

After understanding the intuition and the math behind it, I have deduced the below equations:

  1. \delta^{[L]} = (a^{[L]} - y^T)
  2. \delta^{[l]} = ((w^{[l+1]})^T \cdot \delta^{[l+1]}) * \sigma^\prime(z^{[l]})
  3. \frac{\partial J}{\partial w^{[l]}} = \sum^L_{l=1} \delta^{[l]} \cdot (a^{[l-1]})^T

In equation [2], w^{[l+1]} is transposed. So I did so in my implementation, but it threw Value Error exception because the dot multiplication has matrices with wrong shapes.

After deep investigation, I found that the shape of the weights w in the course content and TensorFlow is (S_{in}, S_{out}) while it is mentioned as (S_{out}, S_{in}) in everywhere else. Even in the threads discussed here, like in the below threads:

  1. The parameter w2 of layer 2 is of shape (layer2.number_of_units, layer1.number_of_units).
  2. Surprisingly, Tensorflow arranges weights in a matrix of shape (number of neurons(features) in the last layer, number of neurons in this layer).

Which would justify the Transpose perfectly.

So my question is, what is the standard for the shape of the weights?

In the sake of sharing knowledge, I studied the backpropagation from:

  1. Backpropagation calculus | Chapter 4, Deep learning - YouTube
  2. Lecture 12 - Backprop & Improving Neural Networks | Stanford CS229: Machine Learning (Autumn 2018) - YouTube
  3. Neural networks and deep learning

Hello @Mahmad.Sharaf ,

The standard for the shape of the weights can be determined by you, and it will work AS LONG AS YOU KEEP IT CONSISTENT AND ADJUST THE FORMULAS TO YOUR CHOSEN SHAPE across your entire model.

You can define that W’s shape is {current_layer_units, previous_layer_units} or you can define that W’s shape is {previous_layer_units, current_layer_units}.

And moving forward, just make sure that the linear equation and all other formulas are consistent with your definition. For example, if you define that W’s shape = {previous_layer_units, current_layer_units}, the linear equation would be of the form z = W * X.T + b. Note that here I am transposing X.

In fact, if you decide to follow this specialization with the Deep Learning Specialization, you’ll notice how Prof. Ng uses a different shape in W than the one he uses in the Machine Learning Specialization you are taking.

Again: the key is to be consistent with your chosen shape.

You can see another response to this very same question HERE from one of our Super Mentors, @paulinpaloalto .

I hope this sheds light to your question.

Juan

2 Likes

I am glad that this is all about.

Thank you so much for the detailed response

1 Like

Thanks for the explanation! It was a bit confusing, would be nice if the labs and docs in the ML specialization didn’t have this transposition, as it seemed like it was done intentionally based on how it’s written, and I spent time trying to understand why.