Question - Model Structure

Hello everyone,

I have a general question about the model structure that is proposed in the assignment.

Since we are trying to predict binary outcome, why do we use final layer with 2 nodes and softmax activation function? Would it be better to use a single node with sigmoid activation functions?

The official Tax documentation also has an example with binary target and 2 nodes and softmax, so I assume there should be a good reason. As I understand, the sigmoid might be considered as a special case of the softmax, am I thinking about it correctly? But even if so, why not to use simpler structure instead?

Finally, I couldn’t find any examples/tutorials for Tax with sigmoid in the internet. I tried to implement it following the same code/framework as in the assignment (changing final Dense layer to 1, replacing softmax with sigmoid and using binary cross entry loss), but I’m getting some error that I can’t debug. So, if anyone can point to some code snippets that would be much appreciated.

Thank you beforehand!

Hi dmslava,

Good questions.

My guess about why the softmax is used is that this makes it easier to generalize to multi-class classification. I would read the use of softmax rather than sigmoid in the trax docs as indicating that any additional compute costs are negligible in practice.

Maybe relatedly, it’s difficult to come up with code examples implementing sigmoid with trax in binary classification. One related code example is this one. As is done in the example, you could try using trax.fastmath.ops.expit. But you might as well decide to stick to softmax if you run into too much trouble.