I’m still in week 1 of the Deep Learning Specialization’s first course. I have a short question on why we now opt to choose ReLU activation functions over Sigmoid functions (I still don’t know if I will get the answer later on in the course). From what was mentioned in the lecture, I understood that Sigmoid functions will result in vanishing gradients (small gradients) which would make the learning process slower and more costly, unlike ReLU functions. My question is how? Another question is: Aren’t these 2 functions semantically different? Sigmoid is usually used to predict probabilities (in logistic regression) unlike ReLU. Why these 2 anyway?
Advantage of ReLU:
ReLU is extremely easy to compute. If the value is positive, the slope is 1. So they’re quite efficient. Sigmoid() requires that you compute some exponential function, so it takes more math cycles.
Disadvantage of ReLU:
Unfortunately, ReLU gives a zero output for all negative inputs. So you need a lot more ReLU units to do the same job that a single sigmoid() unit can (because some of them will learn a negative weight value, so they provide an output for negative inputs).
In addition to Tom’s points, note that relu and sigmoid are not the only choices for activation functions in the hidden layers of a neural network. In the assignment in C1 W3, you’ll also see tanh used. At the output layer, your choices are constrained by what the output of the network means: if you are implementing a classifier, then the output layer will either use sigmoid (if it’s a binary classification) or softmax (if it’s a multi-class classifier).
Please “hold that thought” and Prof Ng will discuss these issues more as you complete DLS C1 and in the later DLS courses (C2, C4 and C5).