Isn’t there a large problem with co-adaptations of neurons during backpropagation? What I mean is that the way the backpropagation algorithm works is that each weight or bias moves in the direction that would decrease the loss function. However, wouldn’t there be problems when multiple changes cancel each other out or lead to too big of a change? For example (a very cherry-picked example), let’s say that a neuron is taking in a value of 1 and has a weight of of 0.1, there is no activation function, and the expected output of the neuron is -0.1. Since the value the neuron is currently giving is 0.1, the weight tries to become negative to fix the problem (that’s how the partial derivative works), but the weight of the previous neuron also tries to change so that the input to the last neuron is negative. Then, in this case, the weight could become -0.1 and the input could become -1 because both are being changed to minimize the loss, but they aren’t “communicating with each other” to ensure that the changes don’t cancel. In this case, the output is still 0.1 and the weight and input could keep cycling signs because they aren’t communicating with each other. I know this is a very cherry-picked example, but in deep neural nets with thousands of parameters, isn’t it likely that parameter changes can cancel each other out or produce too drastic of a change, like if some parameters decrease a small amount but their combination becomes much lower?

Basically, my question is how do deep neural nets manage these complex coadaptations to ensure they don’t cancel each other out or lead to drastic changes? I know learning rates can be tweaked, but it doesn’t make sense that one learning rate can work well for the whole network and that choosing a specific learning rate will completely eliminate these errors, especially for deep networks. For example, in a deep network if each weight is changed by a small amount, the end output will be very large since each change is being compounded.

I know that dropout was designed to “reduce the overfitting effect of complex co-adaptations”, which leads me to my 2nd question. I don’t see how these complex co-adaptations lead to overfitting, from my intuition, it just seems like they’re bad for optimization in general. So I don’t see how dropout is a regularization method, to me it just seems like a better optimization method similar to Adam, RMSProp, or momentum.

Lastly, couldn’t this problem be solved by changing the weights of each layer of a neural net at a certain time, instead of changing every weight at once? Do you think it would solve the problem, or just be a massive time waste?