Output layer: why a linear activation function instead of a relu?

SimonG · July 3, 2024, 10:27am

Hi all,

I noticed in the course the “Improved implementation of softmax” is used a linear function, i.e.:

tf.keras.Input(shape=(XXX,)),
Dense(25, activation=‘relu’, name = “L1”),
Dense(15, activation=‘relu’, name = “L2”),
Dense(10, activation=‘linear’, name = “L3”),

model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
)

In the last layer may I use “relu” instead linear? It is not specified why it has to be “linear” in the class, is it because of convergence issues?
Thanks!

gent.spah · July 3, 2024, 11:20am

In this case its using lineas because here: loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

you are using from logits = True which basically runs a sigmoid on the linear output!

paulinpaloalto · July 3, 2024, 2:40pm

Right! Here’s a thread which discusses the reasons for using from_logits = True mode and more about what that means. And here’s one from Raymond that does a much more complete explanation of the math behind this.

SimonG · July 3, 2024, 3:06pm

MMh, don’t see sigmorid here, we are using a SparseCategoricalCrossentropy not a BinaryCrossentropy or am I wrong?

SimonG · July 3, 2024, 3:46pm

I think in general, checking with chat gpt, why the final Dense function needs to be a linear one and not just positive values (like in the relu) is because there would be a bias in the output and we would lose important information regarding the output, but I guess this is true only if the output can assume negative values?

TMosh · July 3, 2024, 4:05pm

ReLU is not used as an output layer, because of its output characteristic.

Both SCC and BC are used for classification. You can either use sigmoid activation in the output layer, or a linear activation with “from_logits = True”.

TMosh · July 3, 2024, 4:06pm

I recommend you not do this. Language models aren’t reliable sources.

SimonG · July 3, 2024, 4:14pm

I understand, but this doesn’t explain why we prefer a “linear” over a “relu” at the output layer as input for the SCC.

paulinpaloalto · July 3, 2024, 4:17pm

The point is that you are doing a multiclass classification in this case, so the output layer activation is softmax, not ReLU. There is no purpose of adding ReLU to the mix. It’s possible you could get that to work by training, but it serves no purpose: you effectively would have two activations at the output layer (ReLU → softmax). What is the point of that?

Did you read the threads I linked? The point is that from_logits = True just says that the softmax (or sigmoid in the binary case) happens as part of the cost calculation.

But then your actual trained model has logits as the output, so you manually need to include the softmax in “predict” mode. Either that or just do argmax on the logit outputs, since softmax is monotonic.

SimonG · July 3, 2024, 4:31pm

@paulinpaloalto yes I read it. Using linear you are saying basically you are not using an activation function at all at the end basically, therefore you use softmax combined with the cost calculation for numerical stability? Thanks!

paulinpaloalto · July 3, 2024, 5:14pm

Yes, that’s correct. Although I would phrase it differently: softmax is the activation, it’s just that you don’t explicitly include it in the output layer. It gets applied in the loss calculation and then as an extra step when executing the model in “predict” mode.

Gaurew · July 3, 2024, 6:11pm

When you see why logits = True is used as it helps in better stability for the softmax to work. You could have put softmax at the last layer and set logits = False but the more robust way and gives more stability is when you set logits = True and last layer as linear which computes softmax internally and is more good.

SimonG · July 4, 2024, 11:52am

Ok thanks for the answers!

Topic		Replies	Views
[Week 2] Assignment 2, Exercise 2 : Why should we choose 'linear' output instead of sigmoid output if it's binary classification problem and not linear regression? Convolutional Neural Networks	1	758	April 19, 2021
C4W2 activation in output layer Convolutional Neural Networks	1	514	August 19, 2021
Improved implementation of softmax - Neural network training \| Coursera Advanced Learning Algorithms week-2	1	64	June 25, 2024
Why Activation function in last layer - linear - C4W2 Convolutional Neural Networks	1	524	January 16, 2022
Week 2 Assignment 2 alpaca_model new Convolutional Neural Networks	3	517	June 7, 2022

Output layer: why a linear activation function instead of a relu?

Related topics