Similar to linear func when used in hidden layer, it just changes it to linear layer later and does not serve any purpose.
Isnt it the same with relu if all the input values are always positive?
Non linearity comes into play when output of the affine function (i.e. wx + b) is negative. Please keep in mind that weights and bias can be any real number.
As you rightly observed, for non-negative output of the affine function, both linear and relu activations are the same.
i still do not understand the need of relu ? compaired to tanh h and sigmoid
The advantage ReLU has over tanh and sigmoid is that it’s a lot faster to compute.
Please read this page on why ReLU is a good choice in the Advantages section. That said, 1 problem that’s woth noting in the same page is Dying ReLU. A variant of the ReLU function called Leaky ReLU can be used to mitigate this issue.
I don’t think anyone uses tanh as an activation function in intermediate dense layers since ReLU (and its variants) are the default choice nowadays.