Activation functions vs optimization method

If you are doing binary classification, the combination of sigmoid and cross entropy loss is the only reasonable choice. There are simple ways to deal with the “saturation” issue. In mathematical terms, the output of sigmoid is never exactly 0 or 1, but we are dealing with the pathetic limitations of floating point representations here. Here’s a thread which shows a couple of simple techniques for avoid Inf or NaN values from the loss.