Hello. I had two questions while completing this assignment:
Why is the validation accuracy consistently higher than the training accuracy? I believe we did not discuss this case much in the lectures. Are we using too much regularization? Why would this be desirable?
The activation of the output layer is linear. Why is that? I would have expected sigmoid. It seems to be in line with the fact that we are supposed to choose the binary cross entropy loss function with ‘from_logits=True’, but at the same time it looks to me like the lables of the training data are probabilities?!
Given the small size of the datasets used in the exercise, and that we’re not training for very long, the results are pretty good. You can’t look too closely if the data set doesn’t have good statistics.
If you have only a single output, you can use a linear output for predicting classifications just fine. Consider that if you used a sigmoid() output, it’s a monotonic function, and it won’t change where the relative threshold is between false and true. You just have to use the correct threshold. If you had a sigmoid output, the threshold would be >= 0.5. That’s the same as using a linear output with a threshold of >= 0.
Hey, thanks a lot for your reply! Regarding linear activation it makes perfect sense. I am not sure it quite answers my first question, though!?
I was not bothered about the absolute levels of accuracy. My confusion comes from the fact that apparently the model performs systematically better on the validation set than on the training set. In my mind, that’s not what should happen!? Are you saying it is just a statistical fluke given the test set is small (and might accidentally have a lot of “easy” to classify images)? Or does it perhaps come from using dropout, which limits training set performance? Or something else?
In short: what are the possible reasons why a model might perform better on the validation set than on the training set? (I feel like this is a question that should have been discussed in course #3 but wasnt.)
Since you don’t know how the training and validation sets were selected, you can’t really draw any useful conclusions about small differences in performance.
But in general:
If the training, validation, and test sets don’t give similar performance, the reasons could be:
poor statistics due to not enough data.
badly randomized data sets
phase of the moon or the position of the planets. It’s a statistical process, sometimes weird stuff happens. Roll the dice, re-arrange the subsets, and try again.
Note that when you use the Binary Cross Entropy Loss Function with from_logits = True, that means that it is applying sigmoid internally. It is done that way because it is a) more efficient (one less call) and b) more numerically stable (e.g. they can handle the case of sigmoid saturation more easily). This is all covered in the documentation for BCE Loss.