Sigmoid activation function issues

I am using a basic CNN to attempt binary classification on 1D data (think similar to ECG). I am using a sigmoid activation on my last fully connected layer. However, when I print the predicted output values, they are only in the range of (0.5, 0.75). My understanding is that the sigmoid activation is supposed to generate prediction values between (0, 1). I have altered my model in several different ways (more/less layers, different learning rates, regularizers…) but still end up with the same issue. Adding more training data doesn’t seem to affect it either. Has anyone had a similar issue or know what the issue may be? Would appreciate any insight.

2 Likes

Yes, you’re right that the range of the sigmoid function is (0,1), but the output is going to depend on the input, right? Note that sigmoid(0) = 0.5, so it must be the case that the output of your last linear layer is all > 0. Why would that be? Did you also include a ReLU before the sigmoid?

Update: note that sigmoid(1) = 0.73xxxx, so the other possibility is that you cascaded two sigmoids or a softmax followed by sigmoid. If 0 < z < 1 then 0.5 < sigmoid(z) < 0.75.

Whai is your hidden layer activation?

Do you have an expectation that a plain NN can handle this data?

What preprocessing are you using?

Hi paulinpaloalto,
I have relu activations in each of my conv layers, as well as a relu activation in the fully connected layer before the sigmoid layer. No other softmax or sigmoid layer present. Even if I change my last linear layer, I still get the same results, so I’m not sure what is causing those outputs to all be between 0 and 1. Your explanation makes sense, thank you. If I reduce my model to its simplest form (1 conv layer, pooling layer, fc output layer), I still get the same results. Is it logical to investigate my preprocessing as the issue?

Hi Tmosh,

My hidden layer activations are all relu. I have previously tried to use a NN on this data and it wasn’t yielding as good of results as I had hoped, so I am now trying to use a CNN to see if it would do better (still haven’t ruled out reverting back to a plain NN). I am using z-score normalization on my data (leads to distribution of (-4,4), and feeding 120 samples of my data at a time into the CNN.

Even if the hidden layer activations are ReLU, that should not (in principle anyway) cause your output layer’s linear values to be all positive. It just depends on what coefficients are learned on that last FC layer. They could be negative if that helps give a better cost value. So one other question would be how you have configured your loss function. And I assume it must be the case that you’ve got samples with a “false” label, meaning that the loss will be high if you predict “true” for everything. It might help to know the size of your training dataset and the number of false and true images. If the dataset is very unbalanced towards true samples, that might cause a problem like this at least in theory.

Maybe it would help to actually see the summary of your model. E.g. try this and show us what you get:

print(model.summary())

Capture

I am using binary_crossentropy as my loss function. I have 1571 samples in my training dataset, 938 true and 633 false. I recognize that this is unbalanced and have been working on getting more ‘false’ examples.

That’s not badly unbalanced, I recommend you focus your attention the model performance. You likely have enough data.

What is the format of the examples? Are they time-sequences? If so, how many samples are in each example?

1 Like

At a glance, I think your model might be unnecessarily complicated for the task.

1 Like

Yes, as Tom says, you’ve got plenty of negative samples so that it is still a mystery why your output sigmoid values are all > 0.5. It must be something with the output activation functions and the loss function. At least that’s my theory. Sorry, I thought we’d be able to see the activations in the “summary” view. Maybe you need to show us your TF code for defining the last 3 dense layers and for how you “compile” your model.

I’m also curious about any pre-processing of the data.

Actually, here’s a new theory that just occurs to me:

Maybe you manually included the sigmoid activation in the output layer, but also passed from_logits = True to the loss function, which would have the effect of including the sigmoid in the loss processing on top of the actual sigmoid that you already included.

I’ve never tried making that mistake, but I’d have to believe it would not work. In particular, it would make the training not work very well, because there’s no way for the gradients to force a “false” answer, because what it thinks are the “logits” are always > 0.

Yes, each example has 120 samples (2 seconds of data).

Here is a snapshot of my model and how its compiled. The learning rate is small, but over 1000 epochs it results in training and cv accuracy of 98% and both losses around 0.05

If you’re getting 98% accuracy after training, what are we discussing here?

In terms of preprocessing, I basically just take the raw dataset and perform the z-score normalization. I don’t perform any other pre-processing other than splitting it into the 1571 examples of 120 samples.

Currently it has 75% accuracy on larger testing datasets. I am mostly concerned about whether it is actually learning or not, and the skewedness of the sigmoid layer outputs concerns me since I have to adjust my threshold to compensate. Instead of my threshold being the ‘normal’ 0.5, I have it pushed to 0.625 since the sigmoid outputs values between (0.5, 0.75). I am fairly new to this realm and although taking this step yields positive results, it doesn’t seem like I am building a proper model.

Do not adjust the decision threshold. It’s not necessary.

Getting a 50/50 split in the training set is not necessary.

I’ve seen plenty of systems that work just fine with even a 10/90 split (given you have enough data).

How do you implement that in TF? But as Tom says, I think that’s the wrong way to solve the problem: we need to understand why your outputs are only between 0.5 and 0.75. The code you showed above does not give us the information about how you defined the loss function. Normally it is in a “compile” statement like this:

unet.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

But note that you do not want to use from_logits=True as they do in that example.

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), loss='binary_crossentropy', metrics=['accuracy'])

This is the only compile statement I have in my code, it is at the very bottom of the previous snapshot as well.