Course 1 Week 3

bgoyal · May 21, 2021, 7:05am

Why are the random values we initialise the W1 and W2 matrices by divided by 100? Why can’t we just use the generated values?

paulinpaloalto · June 27, 2021, 11:02pm

It turns out that smaller values are better for convergence of gradient descent. You can also have problems with large input values causing NaN values for cost, because you get “saturation” of the sigmoid function. It is never actually equal to 0 or 1 from a mathematical point of view, but it can happen in floating point because of rounding. If you get exactly 1 as the output of sigmoid, that will cause the cost function to give NaN values.

Topic		Replies	Views
Course 1 Week 2 A2 np dot leads to nan Neural Networks and Deep Learning coursera-platform	3	502	July 14, 2023
Cost = Nan AI Discussions ai-discussions	5	34	January 23, 2025
NAN as results for the cost computations Neural Networks and Deep Learning coursera-platform	27	608	December 27, 2021
W2_A2_Ex6 optimizing log(0) error Neural Networks and Deep Learning coursera-platform	5	321	January 2, 2024
Random initialization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	505	January 15, 2022

Course 1 Week 3

Related topics