Why "use a prediction layer with one neuron (as a binary classifier only needs one)"?

sunson29 · November 26, 2022, 9:55pm

Hi guys,

For HW of “Transfer_learning_with_MobileNet_v1, Exercise 2 - alpaca_model”, the last layer is

    # use a prediction layer with one neuron (as a binary classifier only needs one)
    outputs = tf.keras.layers.Dense(1)(x)

I am not very following on this. So, this time we don’t need sigmoid? And why just one neuron? I checked tensorflow.org, it says

Apply a tf.keras.layers.Dense layer to convert these features into a single prediction per image. You don’t need an activation function here because this prediction will be treated as a logit , or a raw prediction value. Positive numbers predict class 1, negative numbers predict class 0.

but I don’t understand it. Maybe I forgot something. If there is a link about this, please share it, I will read as well. Thank you!

paulinpaloalto · November 26, 2022, 11:19pm

Check the documentation for the loss functions and look at what the from_logits argument does. It turns out that it is both more efficient and more “numerically stable” to implement the sigmoid or softmax activation as an internal part of the cost computation, instead of doing it as a separate step. That is why Prof Ng always does it this way as soon as we switched to TF in C2 W3. Remember that’s how it worked in the compute_cost function there.

The term “numerically stable” is not just some handwaving. That is actually a well defined “term of art” in the field of Numerical Analysis. It turns out that when you are working with finite representations like floating point, instead of the abstract beauty of \mathbb{R}, there can be different ways to express the same mathematical formula which have better or worse properties in terms of rounding errors. “Numerical stability” is a precise way to reason about that phenomenon.

The reason you only need a single output is precisely that we’re back to doing a binary classification, instead of a multi-class one. The answer is “yes/no”, so you only need one bit to express that, although we end up with 64 bits (but a single value).

Also note that when you use the resulting trained model in “inference” mode to make predictions, you will need to apply sigmoid manually to convert the prediction to a probability of “yes” (it’s an alpaca). Or you could use the fact that sigmoid is monotonic and sigmoid(0) = 0.5, so the prediction is “yes” if the logits output value is > 0.

sunson29 · November 27, 2022, 7:01pm

got it, thank you!!

Topic		Replies	Views
C4 W2 Exercise 2 Convolutional Neural Networks coursera-platform	1	500	October 3, 2022
Where is the activation function in Week 2 - Transfer Learning assignment Convolutional Neural Networks coursera-platform	6	530	July 10, 2022
Why don't I need to specify activation='sigmoid'? Convolutional Neural Networks coursera-platform	3	592	February 22, 2023
C4W2 activation in output layer Convolutional Neural Networks coursera-platform	1	515	August 19, 2021
Week 2, prog_assgn, Ex-2 Convolutional Neural Networks coursera-platform	5	531	October 25, 2021

Why "use a prediction layer with one neuron (as a binary classifier only needs one)"?

Related topics