Good day everyone!
Please I have a question regarding initialization of weights for a neural net. I was experimenting on different random starting values of weight parameters for a neural net and I noticed that the outcome of the model was different with different starting values of the weights.
Does this imply that for tuning the hyperparameters that a constant random value of weight should be chosen? Or is there something I might be getting wrong?
During model development is important to separate changes due to your intervention (training inputs, hyperparameters, architecture…) from those due to randomness. You need reproducible results in order to make any sense of experiments. This link from the Keras FAQ may stimulate thought:
How can I obtain reproducible results using Keras during development?
During development of a model, sometimes it is useful to be able to obtain reproducible results from run to run in order to determine if a change in performance is due to an actual model or data modification, or merely a result of a new random seed…
So the take away is to use seed to ensure reproducible results instead of varying random values…! Thanks alot @ai_curious
It is an excellent point about seeds, but at the higher level is it also true that different overall random initialization routines have different results as well. The choice of initialization algorithm is yet another “hyperparameter”. In Course 1, we only learn the relatively straightforward method of taking a Normal Distribution with \mu = 0 and \sigma = 1 and multiplying it by 0.01. Prof Ng will show us some more sophisticated methods in Week 1 of Course 2 like He and Xavier initialization. Unfortunately there is no one “silver bullet” choice that works the best in all cases.
Keras provides a wide variety of choices. Here’s the top level page on their Initializer class.
Disclaimer: I had never heard of half the algorithms on that page before I looked at it just now. So some further research will be required to understand in which situations they may be applicable.
Thank you @paulinpaloalto … this was definitely very insightful!!
Another excellent paper to read: “The Lottery Ticket Hypothesis” by Frankle and Carbin
We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the “lottery ticket hypothesis:” dense, randomly-initialized, feed-forward networks contain subnetworks (“winning tickets”) that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.
So not only do random initial weights produce different model performance, it’s possible to “game the system”: you can identify “lucky” initial weights that produce benchmark-level performance on much smaller architectures.
Wow!.. I have never thought of it this way… thank you @Kyle_Hale for this.
Another great insight @paulinpaloalto
As pointed out by Dr. Ng in the Week 2 lectures, weight initialization is not an issue for logistic regression. This is because logistic regression, coupled with a binary cross entropy loss, results in a convex optimization problem. For convex optimization problems, the local minima = global minima.
In the Week 3 lectures, a non-convex optimization problem results, from “stacking” the Week 2 logistic regression components. Because non-convex functions can have numerous local minima, the starting point / weight initialization becomes important. This is because a different local minima results are produced, depending on where you start (i.e. Your weight initialization.) As mentioned previously, in order to reproduce a result, you need to employ the same seed, so that the same weights are used.
Finally, although everything in the Deep Learning Specialization deals with Supervised Learning, it is important to point out that weight initialization is a highly problematic in Reinforcement Learning. For example, you conduct an experiment using 10 different random seeds. You average the results from the first 5 random seeds, and then, average the results from the second 5 random seeds. In practice, a huge difference can result, often leading to incorrect conclusions about a deep reinforcement learning algorithm.
The bigger problem - not just restricted to weight initialization - is discussed in the following video:
Thank you very much for this! @earl2020wong