In the third week of this course, I learned that the parameters of the linear transformation part of the NN should not be zero matrices, because each parameter would be updated in exactly the same way, losing the meaning of separating features. .

However, even if the original linear transformation part (W) is a 0 matrix, if a different initial value is set for the constant parameter (b) that is added later, the learning will not be symmetrical?

Why do we care that the initial value of W is not a 0 matrix? If itās common to start b with a 0 vector, then that makes sense, but if so, why is b initially set to a 0 vector?

Think of it this way:

For simplicity, consider, X = 1, W1 = W2 = W3 = 15 and b1 = b2 = b3 = 2.

For every Wx + b, the answer will be 17. What this means is, as you mentioned, there are no separating features.

Now consider X = 1, W1 = 3, W2 = 5, W3 = 15 and b1 = b2 = b3 = 2

The values now become: W1x + b1 = 5, W2x + b2 = 7, W3x + b3 = 17. Now as you can see, different values of W gave us different values, even when b was consistent.

Now we know that W is matrix and b is just a singular value. When Ws are initialised randomly, a singular, same value of b would not have that much affect on them, those value of Ws will still remain different from each other (what we aim to achieve)

This is why we care more about having random values of W and donāt care much even if Bās are set to 0.

Hope I made sense,

Mubsi

Hi, Mubsi san,

Thank you for your reply!

Iām sorry that I couldnāt convey the intent of the question well enough.

Indeed, even if the initial value of b is constant for each feature, the parameter update proceeds well if the initial value of W is random.

However, in the same way, even if the initial value of W is constant, if b is random, the output will not be symmetrical, so the intention of the question is that learning will progress.

For example, X=1, W1=W2=0, b1=1, b2=2.

In other words, the fact that the initial value of W is not a 0 matrix is a sufficient condition for successful learning, but it is not a necessary condition. Why would you prefer a random starting value for W over a random starting value for b?

I donāt think the reply I received just now is an answer to that question, but is my understanding insufficient?

Best,

Shiori

2022å¹“10ę24ę„(ę) 18:18 Muhammad Mubashar via DeepLearning.AI <notifications@dlai.discoursemail.com>:

Yes, you are correct that you can ābreak symmetryā by making the W values constant and the b values random. My guess is that the reason the common practice is to use W as the random values is that it must give better convergence in most cases. You can try some experiments and see if you can see any difference. Hereās a thread from a while back that discusses Symmetry Breaking in more detail.

Note that there are a number of different possible random initialization algorithms. They show us a very simple one in Week 3 and Week 4 of Course 1. But it turns out those straightforward algorithms do not always work very well. Prof Ng will show us some more sophisticated initialization algorithms and discuss these issues in more detail in Course 2, so stay tuned for that. I point this out to give some background on my comment that there may be a reason for not using the bias values for symmetry breaking. Initialization matters for the performance of convergence and there is no single āsilver bulletā solution that works best in all cases.

Thanks Paulinpaloalto san,

Iām looking forward to learning more in detail on the next course or reading your thread!

Best,

Shiori

2022å¹“10ę25ę„(ē«) 0:03 Paul Mielke via DeepLearning.AI <notifications@dlai.discoursemail.com>: