Hyperparameters optimization

It seems to me that maybe not all the hyperparameters could be discovered by individual tuning in isolation, but some (at least) are rather dependent on each other. Considering the example of the XOR function from this course, the learning rate tuning, it seems to me, will depend a lot on the number of hidden layers and their respective sizes.

Therefore, I was thinking if there is some better known approach of how to optimize the hyperparameters as a set, not in isolation. I thought maybe to take a random smaller set of data and then run a set of hyperparams, compute a sort of cost (fitness) function and use some sort of gradient descent alg.

Another idea, but more time costly, could be to use genetic algorithms, but I’m not sure if it’s feasible or it’s just an overkill.

This is an interesting set of questions and ideas. It turns out that there is just too much material to cover all in one course, so the systematic approach to tuning hyperparameters is reserved until Week 1 of the next course in the series. My suggestion would be to “hold that thought” and stay tuned to hear what Prof Ng says about these issues in Course 2.