Based on the line that states, that the sigmoid is best for on/off binary situations I want to ask the following:
Let’s consider a scenario where all of our input features are binary (e.g. they are all signals 0->negative , 1-> positive from diferent sampling techniques).
Should we then use the sigmoid for all layers ( alternatively just for the first and last layer) ?
The output of the network should also be binary in my example (e.g. purchase or not purchase)
I could elaborate more if I am not understood.
Hello Kosmetsas,
If you want your output to be between 0 and 1, it’s a sufficient argument to use sigmoid at the last layer (this echoes the statement “The sigmoid is best for on/off or binary situation”). If you want to transform your input to be between 0 and 1, it’s sufficient to add sigmoid right after your input layer, however, it’s better to do it as a feature engineering step because you will then only need to do the sigmoid transformation once.
If your features are already binary, being binary itself, in my opinion, isn’t sufficient for us to use sigmoid in any of the layers. Otherwise, we would just min-max normalize any continuous features into ranging between 0 and 1, and stick with sigmoid forever and perhaps we don’t need to invent ReLU or other activations.
Bringing in non-linearity is an important reason for us to use ReLU or sigmoid, or other activation functions other than the linear activation.
Bringing in ReLU has its significance and I compared ReLU and sigmoid in this thread, including a reference to Professor Ng’s DLS video on Activation function. Let me know if any of the points there needs more clarification.
Raymond