It’s great to hear that you find Prof Ng’s course(s) well structured and useful. I can claim no credit for that: the mentors are just fellow students who volunteer their time to answer questions here on the forums, but we had nothing to do with the creation or content of the courses.
I think you are trying too hard here. What is “a carry” in your sense? Like a “carry” in successive addition? Before we “go there”, let’s consider the behavior and the implications for some of the commonly used activation functions that Prof Ng has shown us so far.
Consider the following 4 examples: ReLU, Leaky ReLU, tanh
and sigmoid
. Look at their graphs. What can we conclude from them?
They are all monotonic functions. If z_2 \geq z_1, then you can conclude that g(z_2) \geq g(z_1). But note that monotonicity is not strictly required of an activation. swish
for example is not monotonic.
Beyond that, they have somewhat different behaviors:
ReLU acts like a “high pass” filter: it just drops all negative input values and replaces them with 0.
Leaky ReLU doesn’t drop negative values, but reduces their absolute values to some degree based on the slope you choose (a hyperparameter).
Both tanh
and sigmoid
have a very similar shape (and in fact are quite closely related mathematically): they have “flat tails” as |z| \rightarrow \infty so they make large values more or less interchangable with each other above some threshold of |z|. They also “clamp” all the values between either -1 and 1 (for tanh) or 0 and 1 (for sigmoid). So we can interpret the output of sigmoid as a probability and say that it is predicting “True” if the output is > 0.5.
For the output layer of a network, the choice of activation is fixed by the purpose of your network: if it is a binary classifier (“Yes/No”, “Cat/Not a cat”), then you always use sigmoid
. If it is a multiclass classifier (identifying one of a number of animals or objects or …), then you use softmax
, which is the multiclass version of sigmoid
. It gives you a probability distribution on the outputs of the possible classes.
Finally getting back to your actual question:
How about this as a way to describe what you are getting at there: the activation function provides a form of “interpretation” of the input. How that interpretation looks depends on the function. In the case of ReLU, the “interpretation” is: only positive values are interesting. In the case of Leaky ReLU, the interpretation is “negative values should have less effect than positive ones”. Instead of just dropping them as ReLU does, Leaky ReLU just “tones them down”. In the case of tanh
and sigmoid
, the “interpretation” is clamping the values to a fixed range, but in a monotonic way. The net effect of that is that the differences in values becomes a lot less significant, the farther away from the origin you are. In the particular case of sigmoid, that gives you the final level of “interpretation”: mapping the input to a probability.