In the week4 video, there is an explanation on intuition on how different layers might classify different parts of the image. Is there a difference in activation functions performance for different stages of face recognition? e.g ReLU will better perform for separating the face from the whole image, but tanh will be better for finding the shape of the face.
Generally, my questions are:
Does it make sense to use different activation functions for the hidden layers within one model?
If we use the complex function for the activation layer, might it act like some simpler function?
e.g if we use leaky ReLu for logistic regression, after optimization will we get weights so that there will be no difference between leaky ReLU and ReLU
If I have a problem, which can be perfectly solved using logistic regression, does it mean, that solving it using deep learning will give the same result in accuracy but would require more computation power?
Interesting questions. I don’t have complete answers for all of them, but here are some thoughts for further discussion:
There are lots of choices for activation function in the hidden layers. Here’s another recent thread about that. But it is an interesting question of whether it would ever make sense to use different activation functions at different hidden layers of the network, e.g. ReLU at the earlier layers and then more expensive functions like tanh later. In all the examples I have seen in the DLS courses, Prof Ng is always consistent in the use of hidden layer activations in any given network. So I don’t have any experience with that idea. But this is an experimental science! You could try some experiments with this idea and see if you learn anything interesting. Please let us know if you try that and what you see!
I’m not sure what you mean here. I guess there is no reason to believe that you could not come up with a scenario where two different functions give pretty similar results. Here again, I don’t know any specific examples from experience. If you try any experiments, let us know.
Prof Ng presents Logistic Regression first as a “trivial” Neural Network. The output layer of a binary classifier is identical to LR. But the point is that LR is only capable of linear decision boundaries: the solution is a hyperplane in the input space that does the best job of separating the “yes” and “no” answers. So you would expect in principle that a NN can do a better job, because it is capable of non-linear decision boundaries. But with an NN, the cost function is no longer convex, so you have the issue that you may find different local minima. So in principle the NN should do at least as well as LR once you get your hyperparameters nailed.
Hi @Nick2, to your first question, yes, we can use different activation functions for different neurons in the same layer. Different activation functions allow for various non-linearities that might work better in understanding or solving a specific function. Otherwise, all hidden layers, in general, use the same activation functions. In this course, Prof Ng has tried to use ReLU activation function, as it is the most common function used for hidden layers. It’s easier to implement and best at overcoming limitations of other activation functions.
The activation function used in the hidden layer is mainly chosen based on the kind of NN architecture. The modern NN with common architecture such as MLP (Multilayer Perceptron) and CNN will generally use ReLU, whereas, RNN will make use of Tanh or sigmoid activations.
The output layer typically uses a different activation function from the hidden layers. It depends on the kind of predictions required by the model. In general principle, either sigmoid or softmax activation functions are used.
@Rashmi, for starters that is not the question Nick2 asked. The question was “do you need to use the same activation function in all the hidden layers or can you use different functions in different layers?” It is not whether you can use different functions within the same layer. And I have never heard of anyone doing the latter. Do you have a reference to any paper in which they discuss that?
After going through your fantastic explanations, I tried to understand this problem more comprehensively through web search. There, they discussed about the importance of using different activations in same or different layers to get the required output. Although, there is no such paper that describes this case specifically. I just wanted to share what I read there and how was it impacting the occurences.
Nick’s first quest prompted me to find the subsequent underlying questions on how using different activations in the same or different layer can change the whole scenario of the neural network architecture.
I couldn’t justify the reasons more comprehensively, yet I brought few facts into focus that could accelerate the NN architecure to bring out desired results.