I hope I’m not going too far into the weeds here, but there are some more things to say based on the above:
You can add logic to your cost implementation to avoid the NaN issue caused by sigmoid rounding to 1 or 0. Here’s a thread about that.
There are also other common methods of normalization that are sometimes used in cases dealing with RGB images:
- Instead of just dividing by 255, you can rescale and shift so that the range is [-1, 1] with a mean of 0.
- You could do “mean normalization” so that the result is Gaussian with \mu = 0 and \sigma = 1.
You could add those to your suite of experiments and see the results. But the overall point is that the simplest method of just scaling by 255 gives as good results as any of the others and it is the simplest code to write and also the cheapest in terms of compute cost. So why not go with the simplest and cheapest?