Output layer: why a linear activation function instead of a relu?

Hi all,

I noticed in the course the “Improved implementation of softmax” is used a linear function, i.e.:

tf.keras.Input(shape=(XXX,)),
Dense(25, activation=‘relu’, name = “L1”),
Dense(15, activation=‘relu’, name = “L2”),
Dense(10, activation=‘linear’, name = “L3”),

model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
)

In the last layer may I use “relu” instead linear? It is not specified why it has to be “linear” in the class, is it because of convergence issues?
Thanks!

1 Like

In this case its using lineas because here: loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

you are using from logits = True which basically runs a sigmoid on the linear output!

2 Likes

Right! Here’s a thread which discusses the reasons for using `from_logits = True` mode and more about what that means. And here’s one from Raymond that does a much more complete explanation of the math behind this.

2 Likes

MMh, don’t see sigmorid here, we are using a SparseCategoricalCrossentropy not a BinaryCrossentropy or am I wrong?

1 Like

I think in general, checking with chat gpt, why the final Dense function needs to be a linear one and not just positive values (like in the relu) is because there would be a bias in the output and we would lose important information regarding the output, but I guess this is true only if the output can assume negative values?

2 Likes

ReLU is not used as an output layer, because of its output characteristic.

Both SCC and BC are used for classification. You can either use sigmoid activation in the output layer, or a linear activation with “from_logits = True”.

1 Like

I recommend you not do this. Language models aren’t reliable sources.

2 Likes

I understand, but this doesn’t explain why we prefer a “linear” over a “relu” at the output layer as input for the SCC.

1 Like

The point is that you are doing a multiclass classification in this case, so the output layer activation is softmax, not ReLU. There is no purpose of adding ReLU to the mix. It’s possible you could get that to work by training, but it serves no purpose: you effectively would have two activations at the output layer (ReLU → softmax). What is the point of that?

Did you read the threads I linked? The point is that `from_logits = True` just says that the softmax (or sigmoid in the binary case) happens as part of the cost calculation.

But then your actual trained model has logits as the output, so you manually need to include the softmax in “predict” mode. Either that or just do argmax on the logit outputs, since softmax is monotonic.

3 Likes

@paulinpaloalto yes I read it. Using linear you are saying basically you are not using an activation function at all at the end basically, therefore you use softmax combined with the cost calculation for numerical stability? Thanks!

1 Like

Yes, that’s correct. Although I would phrase it differently: softmax is the activation, it’s just that you don’t explicitly include it in the output layer. It gets applied in the loss calculation and then as an extra step when executing the model in “predict” mode.

2 Likes

When you see why logits = True is used as it helps in better stability for the softmax to work. You could have put softmax at the last layer and set logits = False but the more robust way and gives more stability is when you set logits = True and last layer as linear which computes softmax internally and is more good.

1 Like