Similar question asked here, but the answer was “why it worked” and did not address motivation: ReLU is used in hidden layers WHY?. My question is asking about the motivation behind using ReLU in the hidden layer when the output is linear, as opposed to using linear activations in the hidden layers.
Nevermind, the next lecture “Why do we need activation functions” addressed my question - a linear function of a linear function is still a linear function.
Exactly. Because of that mathematical fact, there is literally no point in adding layers to the network unless each one of them is non-linear. You don’t get a more complex function without the non-linearity.