It seems to me that maybe not all the hyperparameters could be discovered by individual tuning in isolation, but some (at least) are rather dependent on each other. Considering the example of the XOR function from this course, the learning rate tuning, it seems to me, will depend a lot on the number of hidden layers and their respective sizes.
Therefore, I was thinking if there is some better known approach of how to optimize the hyperparameters as a set, not in isolation. I thought maybe to take a random smaller set of data and then run a set of hyperparams, compute a sort of cost (fitness) function and use some sort of gradient descent alg.
Another idea, but more time costly, could be to use genetic algorithms, but I’m not sure if it’s feasible or it’s just an overkill.