General Question

It’s an interesting question. Trig function are most likely not useful. Using a periodic function means you’re saying that lots of different inputs that may be far apart have the same result. How would that be useful in this case? Notice that most of the activation functions we have seen are monotonic non-decreasing. I don’t think that is strictly necessary, since swish is commonly used and has one region where it decreases a bit.

Note that log is monotonic increasing, but it can’t handle negative inputs. So that points out another characteristic that we need: the domain of the function needs to be (-\infty, \infty).

Also note that the activation for the output layer is well defined: for a binary classifier, we need sigmoid, because a) we need the output to look like the probability of “yes” and b) sigmoid and the cross entropy loss function are tied together. Then for multi-class classifiers we use softmax, which you can think of as the generalization of sigmoid and the cross entropy loss function works with softmax also.

In the hidden layers, we can choose whatever works from experience. That is the high level point: what we’re seeing is the result of a lot of years of experimentation and these are the functions that have been found to work. But this is an experimental science: if you have some new ideas, give them a try and see what happens. Maybe you’ll find something new that works even better. Write the paper and it’ll be your name in lights! :nerd_face:

Here’s a thread which talks about how the choice of hidden layer activation works.

Just on the general topic, here’s a thread about the fact that tanh and sigmoid are actually quite closely related mathematically.

1 Like