Hi, I met some problems when doing the quizzes. Can someone help me?
The first question is about the techniques to help find parameter values that attain a small value of cost function J (considering it takes excessively long to find the parameters). I don’t think 1. better random initialization for the weights and 2. normalising the input data can achieve the requirement.
The second question is about gamma and beta in Batch Norm. I think they are related to all the hidden units in that layer. So why ‘there is one global value gamma and one global value beta for each layer, and applies to all hidden units in that layer’ wrong? And why can’t we tune them by random sampling?
Better random initialization and input normalization are important techniques for improving results.
The point is that gamma and beta are not “global values”: they are vectors for starters and they are different per layer at which batch norm is applied. They aren’t tuned by random sampling: they are tuned by computing the mean and variance of the actual input values.
Hello, Could you please elaborate why the quiz question emphasizes “a small value of cost function”? I understand optimization methods like Adam and better initialization will help speed up gradient descent but why is a small cost function relevant to the premise of the problem? Thanks!