As mentioned, the activation function should not be linear. Still, the ReLU activation function is semi-linear; its derivative is 1 for positive input values and zero for negative input values then, why are we using it? Suppose we get only positive input value for a certain layer, then the whole layer becomes redundant it will only increase the computation. Would you please tell me what could be done in such scenarios?
There will be more to come on this topic in Course 2. As a brief preview, it is a common practice to “normalize” the inputs to not only the feature matrix (X = A^{[0]}), but also the inputs of subsequent layers. By normalization, I mean standardizing the inputs by subtracting the mean and dividing by the standard deviation. Result: a zero-mean input. Positive and negative values guaranteed!
Thankyou for your help