if I take a look at this slide I can see that each layer inside a CNN hast n[l]_c unique biases (b_1, b_2, …). So the number of hyperparameters in each layer should be calculated by the following formula: f[l]_H x f[l]_W x n[l]_c + n[l]_c
Please correct my if I’m wrong. It seems like the formula above is not working correctly for the quiz in Week 1 and the reason being the number of biases. So, is there only one bias in each layer being propagated to each filter or are there n[l]_c unique biases?
There is one bias value for each output channel. So the number of trainable parameters in a layer that has a filter array W which is f x f x nC_{prev} x nC would be:
f * f * nC_{prev} * nC + nC
Or maybe it’s clearer to write it like this:
nC * (f * f * nC_{prev} + 1)
In other words, for each output channel, you’ve got a filter the same shape as the input plus one bias term.
Why is the number of biases nC and not nCprev * nC. I mean for an RGB image, that would have allowed the user to give higher preference to a specific color channel over the others. While (I think) that can be achieved by adjusting the weights, wouldn’t a bias per color channel have been easier ?
It’s a worthy question, but it turns out that doing that doesn’t really add anything other than complexity. You already have learnable parameters corresponding to the weight that gets multiplied by each color channel of the input. If you add a bias at that level, then it you can rearrange the computation so that the bias term is the sum of all those extra terms you just added. So you could have learned the same solution just with a single bias term. With your layout we now have more parameters to learn in order to get to the same place. So why bother?
The other way to make this point is to say that if you want to give more weight to one of the color channels, the model as Prof Ng has defined it already has the ability to do that, right? Each filter already has separate w_{ij} values for each channel. So if giving more emphasis to (say) red colors in the image gives better results, the model can learn that already by just increasing the values it assigns to the w values that correspond to the red channel.
Of course this example is just that: an example. Note that it’s not really the right idea to think of the channels in the inner layers of the network as “colors”. Once you get past the input layer, they are just real numbers that can represent anything (the intensity of the signal that I think I see something that looks like a vertical edge or a cat’s ear or an elephant’s trunk) rather than something predefined like “colors”. See the lecture “What are Deep ConvNets Learning” in Week 4 for more information about that.
Paul, thank you so much for your explanation. I am inspired by the concept that we dont need separate bias for each RGB channel. If we want to detect vertical edges, the W values are all the same no matter we are in R/G/B channel.
Yet, Why don’t we have bias for each W inside a f x f matrix as shown in my diagram?
In a filter of size 2 x 2 , there are 4 distinct weight values.
In a fully connected setting, Z = W * X + b, b is a vector that is same size as Z.
Why when we do convolution, we seems that here we “share a bias value among all weights value inside a filter matrix”?
I addressed this point in my previous response, but perhaps what I said was too many words and should have included some formulas. Yes, you can do that, but you gain nothing from the 4 separate b values: you can also add them up as a single scalar value and it is mathematically equivalent. So why learn 4 separate parameters when learning 1 is equivalent?
And be done. If you take the channels into account, it’s actually 12 separate b values. Why learn 12 parameters, when it’s mathematically equivalent to a single parameter?
Paul, you are really a hero. I got stuck for 2 weeks, checking tones of “other resources” such as youtuber explanation online. It turns out that many of those are using matrix to represent the bias term. The more I check, the more I got confused. Truly thanks for this crystal clear explanation.
And let me summarize the learning.
The no of bias term, b, depends on the no of W terms, NOT related to the no of scalar values in matrix Z. Though we could assign 1 bias, b, for each W, it doesn’t improve our learning algo performance, since the sum of all scalar is still a scalar!!!
Attached is the jpeg for a expanded version without using any summation term, just to share with all learners. Happy learning.