Why use ReLU for hidden layers when output layer is linear?

Similar question asked here, but the answer was “why it worked” and did not address motivation: ReLU is used in hidden layers WHY?. My question is asking about the motivation behind using ReLU in the hidden layer when the output is linear, as opposed to using linear activations in the hidden layers.

Is it similar to the logistic regression question here: Don't use Linear activation in hidden layers where you want to force some non-linearity into the hidden layers?

Some more info on why we want to do this would be great. It doesn’t seem to be addressed in the “Choosing Activation Functions” video.

Nevermind, the next lecture “Why do we need activation functions” addressed my question - a linear function of a linear function is still a linear function.

The hidden layer in an NN always needs a non-linear function.

ReLU is the simplest non-linear function that has low computational impact for both forward propagation and gradients.

Exactly. Because of that mathematical fact, there is literally no point in adding layers to the network unless each one of them is non-linear. You don’t get a more complex function without the non-linearity.