Hello everyone,
I have a general question about the model structure that is proposed in the assignment.
Since we are trying to predict binary outcome, why do we use final layer with 2 nodes and softmax activation function? Would it be better to use a single node with sigmoid activation functions?
The official Tax documentation also has an example with binary target and 2 nodes and softmax, so I assume there should be a good reason. As I understand, the sigmoid might be considered as a special case of the softmax, am I thinking about it correctly? But even if so, why not to use simpler structure instead?
Finally, I couldn’t find any examples/tutorials for Tax with sigmoid in the internet. I tried to implement it following the same code/framework as in the assignment (changing final Dense layer to 1, replacing softmax with sigmoid and using binary cross entry loss), but I’m getting some error that I can’t debug. So, if anyone can point to some code snippets that would be much appreciated.
Thank you beforehand!