AFAIK, data normalization is highly recommended for the input features. The sigmoid layer produces values in the range 0 to 1, which is fine for the next layer. But what about linear & relu activation - it can be arbitrarily big, so the layer with this activation might generate outputs of different scales which is not ideal for the next layer. Do we have such a problem in real-world tasks? How it can be addressed?
Hi @tenzink ,
Here is my two pennies worth of thought:
If the input data has been normalized, then what gets fed into the next layer would be the normalized data.
Relu is an activation function made up of two parts, any positive values will pass through the linear part; and any negative values will pass through the non-linear part - squashing negative value to zero.
Activation function for hidden layer is used for a purpose. So we need to consider why an activation function is chosen. Sigmoid is often used in the output layer for making a binary decision. Due to Relu’s simplicity and efficient to compute, it has become the popular choice for hidden layer.
Normalization of the input features is a standard practice, regardless of what method you’re using (linear regression, classification, NN’s or whatever).
The reason is that normalized features allow you to use a larger learning rate and fewer iterations, so gradient descent runs more efficiently.