It’s an interesting question. Trig function are most likely not useful. Using a periodic function means you’re saying that lots of different inputs that may be far apart have the same result. How would that be useful in this case? Notice that most of the activation functions we have seen are monotonic non-decreasing. I don’t think that is strictly necessary, since *swish* is commonly used and has one region where it decreases a bit.

Note that *log* is monotonic increasing, but it can’t handle negative inputs. So that points out another characteristic that we need: the domain of the function needs to be (-\infty, \infty).

Also note that the activation for the output layer is well defined: for a binary classifier, we need *sigmoid*, because a) we need the output to look like the probability of “yes” and b) *sigmoid* and the cross entropy loss function are tied together. Then for multi-class classifiers we use *softmax*, which you can think of as the generalization of *sigmoid* and the cross entropy loss function works with *softmax* also.

In the hidden layers, we can choose whatever works from experience. That is the high level point: what we’re seeing is the result of a lot of years of experimentation and these are the functions that have been found to work. But this is an experimental science: if you have some new ideas, give them a try and see what happens. Maybe you’ll find something new that works even better. Write the paper and it’ll be your name in lights!

Here’s a thread which talks about how the choice of hidden layer activation works.

Just on the general topic, here’s a thread about the fact that *tanh* and *sigmoid* are actually quite closely related mathematically.