When i doing the Exercise 2 - initialize_parameters_random
in DLS C2W1 Initialization
, it requires to initialize the weights to large random values (scaled by *10) and your biases to zero. And the result of it is:
On the train set:
Accuracy: 0.83
On the test set:
Accuracy: 0.86
But when I trying a small random value(scaled by *0.01)the result turn to bad:
On the train set:
Accuracy: 0.4633333333333333
On the test set:
Accuracy: 0.48
To figure out why it works like that, i tried scaled by 1 and 0.1.
only scaled by 1 and upper make right classification.
Why?
I remeber in previous practise the wight should initialization with np.random.randn(shape) * 0.01
Is there someone help me figure out the reason? I will appreciate very much.
In this lab, they are trying to see the difference between different types of weight initializations. Now, the weights every pass of the training they change right, with enough iterations and hopefully a good local optima is achieved.
But if the start of the weights is bad it will take many iterations and hopefully the process doesnt get stuck on a “not so good” local optima. Therefore they are suggesting weights of 10 times the normal distribution, probably these weights converge faster to a good optima that is why!
1 Like
That graph of the cost displayed is misleading I think.
PyPlot automatically rescales the plot vertically to show movement, but y values (the cost) is probably barely moving at all. The value on the top left of the plot is the scale I think.
Try this plotter:
def plot2(costs, learning_rate):
plt.plot(costs, marker='o', linestyle='-')
plt.ylim(min(costs) - 1e-10, max(costs) + 1e-10)
plt.ticklabel_format(style='plain', axis='y') # Avoid scientific notation
plt.ylabel("cost")
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate =" + str(learning_rate))
ax = plt.gca()
ax.yaxis.set_major_formatter(ScalarFormatter(useOffset=False))
plt.grid(True)
plt.show()
Note that for small weights, the gradients will be small as they will be elementwise multiplied by the uniformly small Z’s, several times in case of a deep NN (that should be “the problem of vanishing gradients”). The A’s will be ~1 (or either 0 or 1 in case of ReLU) as the Z’s will be near 0 but that won’t change anything. Nothing will move particularly fast. You will have to crank up the number of iterations or the learning rate.