Interesting questions. I don’t have complete answers for all of them, but here are some thoughts for further discussion:
- There are lots of choices for activation function in the hidden layers. Here’s another recent thread about that. But it is an interesting question of whether it would ever make sense to use different activation functions at different hidden layers of the network, e.g. ReLU at the earlier layers and then more expensive functions like tanh later. In all the examples I have seen in the DLS courses, Prof Ng is always consistent in the use of hidden layer activations in any given network. So I don’t have any experience with that idea. But this is an experimental science!
You could try some experiments with this idea and see if you learn anything interesting. Please let us know if you try that and what you see! - I’m not sure what you mean here. I guess there is no reason to believe that you could not come up with a scenario where two different functions give pretty similar results. Here again, I don’t know any specific examples from experience. If you try any experiments, let us know.
- Prof Ng presents Logistic Regression first as a “trivial” Neural Network. The output layer of a binary classifier is identical to LR. But the point is that LR is only capable of linear decision boundaries: the solution is a hyperplane in the input space that does the best job of separating the “yes” and “no” answers. So you would expect in principle that a NN can do a better job, because it is capable of non-linear decision boundaries. But with an NN, the cost function is no longer convex, so you have the issue that you may find different local minima. So in principle the NN should do at least as well as LR once you get your hyperparameters nailed.