Why do we have to multiply by 0.01 and not just take those random values generated? Why do we also sample with randn
and not just rand
?
Small random values from a normal distribution are a good choice for the initial weight values.
Weight values can be either positive or negative, right? If you use rand
, you get only positive values. A normal distribution is a better model in general for “real world” statistical phenomena. If you start with all positive weights, but you need to learn some values that are negative, maybe it takes longer? Just an intuition, not a mathematical proof, of course. But the higher level point is that everything is experimental here and there is no one universal right answer that works best in all cases.
So if you have a particular case, try both randn
and rand
with 0.01 scaling and see if you notice any difference in convergence and the accuracy of the resulting model. But as I mentioned, there is no universal answer, so even if you do find a case where rand
happens to work better, I think what Prof Ng is saying is that you have a better chance of success if you start with randn
. But you have to run the experiment to know for sure in a given case …