Every neuron in a neural network has an activation function.
I don’t understand why when building a neural network model the activation function is chosen by layer and not by neuron.
Why I can’t have different activation function in the same layer?
H @KiraDiShira ,
Lets start from a step back: If you try running a model with no activation functions basically you’ll get a linear regression model, where the data would pass through nodes and layers with the linear function ax+b. This means that the output will depend linearly on the input. It is almost as if the entire nework had flattened into a single layer.
Activation functions introduce ‘non-linearity’ to the output of the neurons of each layer, allowing the network to learn complex patterns.
Regarding the ‘same’ activation function: you can define different activation functions per layer. In fact, it is rather common to use, say, ReLU for the internal layers, and then say Sigmoid for the output layer, in a binary classification algorithm.
When using a framework like keras, you can define the activation function per layer. I have not seen a model that uses different activation functions per neuron inside a layer, but I guess you could custom-program something like that. I don’t think the added complexity would produce much difference, though. But again, at the layers level, you can define different activation functions.
@rmwkwok would you like to add something on this?
What would be the point of using different activation functions within the same layer? It presupposes that you have knowledge that different neurons would or should have different behavior somehow. The point is we start with random initializations of the weights and then we run back propagation. There is no guarantee that the same neuron will learn the same behavior if you run the training again with different initialization. In other words, the same things will be learned most likely, but you don’t know a priori which neuron will learn them.
From a purely theoretical standpoint, you could do it your way with a selection of activation functions per neuron, but the computations become more complicated. Don’t forget that you need to track this for back propagation as well. Or maybe the way to think of this is to have some layers that are “split” where they produce two outputs and then merge them to the next layer. You’ll see architectures like that when we get to Residual Nets and MobilNet in Course 4. But even there, Prof Ng doesn’t say anything about using different activation functions within a given layer.
But this is an experimental science. Maybe your idea actually has potential and no-one else has been clever enough to think of it or brave enough to try it before. You can try implementing that and then construct some experiments to see how it works. If it is really better at least in some cases, publish the paper and it will be “Your Name in Lights!”
Hello @KiraDiShira, and Juan,
Ofcourse we can do that. For example, we can create one Dense Layer that uses ReLU and another Dense Layer that uses LeakyReLU, and then we concatenate them to form “one layer”.
However, I think the problem is whether we want to do that or not. The most reliable way to determine it is, of course, for you to try it out, evaluate, and then compare. However, I don’t recall any evidence/research that supports such idea. Also, a group of ReLUs (or its variants) is already a very good candidate to fit to any curve in a linear-piecewise way, so we probably won’t need other activations to chip in for such purpose. Having said that, if I have some insights that support such idea, I would definitely give it a try, but this is really problem specific and we can’t discuss it in general.
Anyway, we can
Amazing - thanks @rmwkwok - very clever