Let me offer an overly simplistic analogy:
- sigmoid - like a normal door - open or closed (1 or 0) - you can go through at the same speed or stop;
- tanh - like a revolving door - change direction or do not change direction (-1 or 1) - you can go through at the same speed or go backwards at the same speed;
It is “hard” for sigmoid values to be at 0.5, and it is “hard” for tanh values to be at 0, hence the different properties of these functions that are useful.
I’m not a big fan of ML community names so below I provide a very simple calculations for things to be concrete.
If we take the forget gate (and “long” memory C_t), then sigmoid is a better choice - since we either want to forget or not forget (0 or 1; open or closed door).
For example, if at current time step t the combination of input (X_t) and hidden state (H_t) “tells” us that we need to “forget” some values we use sigmoid. As an overly simple concrete example, step 3 - F_3 is [-20, -15, 10, 20], then after sigmoid we get [0, 0, 1, 1], if our C_2 was [-0.7, -0.11, -0.07, 1.06] we would continue calculations with [0, 0, -0.07, 1.06] instead of the whole C_2.
If we take the output gate (used for “short” memory or hidden state H_t), then we use the combination of the two:
Whenever the output gate is close to 1, we allow the “long memory” (C_t) to impact the subsequent layers uninhibited, whereas for output gate values close to 0, we prevent the current memory from impacting other layers of the network at the current time step. Note that a memory cell can accrue information across many time steps without impacting the rest of the network (so long as the output gate takes values close to 0), and then suddenly impact the network at a subsequent time step as soon as the output gate flips from values close to 0 to values close to 1.
Note that tanh is a better choice for “long” memory interaction since it can allows for a more “interesting” multiplication of the two (the subsequent hidden state can have values from -1 to 1, instead of from 0 to 1).
If we continue the overly simple concrete example, and calculate H_3 and assume that output gate (O_3) is for example [0, 1, 0, 1]. And now C_3 got added values from input gate and can have a variety of values (not just only from 0 to 1)), so C_3 could have become [-35, -48, -35, 12] and applying tanh would result in [-1, -1, -1, 1] while sigmoid would have squashed the result more (to [0, 0, 0, 1]). So the resulting vector can have a more “interesting” interaction with output gate with resulting H_3 being [0, -1, 0, 1] (instead of sigmoid version [0, 0, 0, 1]).
In reality values do not go that extreme and allow a “flow” of information instead of just closed or open, or flipped or not flipped.
So in the end to answer your questions concretely:
Most probably or even never.
I’m not sure I understand this question how overfitting is related here - the parameter count and your dataset are the main driving forces for overfitting and not the activation functions. Or am I missing some relations here?