I tried the other alternatives in week 3 by swapping the activation functions of the hidden layer for a sigmoid and a ReLU. Both perform horribly. Here is the output of the ReLU version:
Hi @whnr. Your extracurricular experimentation is commendable and great for building insight and understanding. Keep it up.
To be sure that I understand, the nature of your experiment, you wanted to see how the performance (accuracy) would change if you substituted the hyperbolic tangenttanh activation with two alternatives: 'sigmoid, and then 'ReLU.
First, off when changing architectures, one must be mindful of the hyperparameter settings, in this case the learning rate. You want to makes sure that the cost (with respect to the number of iterations) is decreasing, more or less monotonically, to something very close to the minimum in all cases. We need a clean comparison. Let’s assume that you took care of that.
Let’s further assume that your math on the backprop is correct. What is going on here? First, you may have learned why using the sigmoid function in hidden layers is ill-advised. (You can review the “Activation functions” video for a bit more on that.) I agree, the performance on the smaller hidden layers, is abysmal. That said, notice how the sigmoid hidden activation performed with 50 hidden units. Pretty well! But, the tanh NN only required 4 units to get comparable performance! Why does it do so much better? Given that their shapes are qualitatively similar ("S-like), and by implication their gradients, one is led to think about how the actual range of values associated with the two functions is different. How might the presence of negative activations help out?
Now for the ReLU comparison. I am surprised, too. That’s good, so thanks for sharing! It would be nice to see how it did with 50 hidden units as a direct comparison to the sigmoid case! I would only note that the nature of the classification problem here is atypical in that the decision boundary is quite complex and sharply defined. Speculation. That said, my guess is that a deeper network, (e.g. one extra hidden layer) may tilt things more toward a ReLU preferenc.e More speculation. You might want to revisit that question after Week 3.
Thank you for your reply @kenb! Week 4 and the beginning of course 2 already shed more light on the topic of hyperparameter selection / tuning.
On the point of tanh vs sigmoid here is my takeaway:
A sigmoid will always have a positive activation, no matter what Z you put in. It is therefore impossible to have any negative activations. If the gradient decent wants to “turn off” a hidden unit, it would optimize for a very negative Z => gradient is very close to zero. Which will slow down learning
tanh on the other hand can achieve an activation of zero where the gradient is the largest. So it’s easy for gradient decent to invert or completely turn off a hidden unit.
Disadvantage of both vs ReLU is that both functions require significantly more computation than a simple comparison like max(0, z)that a relu requires.
As @kenb says, it is admirable that you are doing this kind of experimentation! You always learn something interesting when you take the course material and try this type of experiment.
That said, I think there is a problem in your backprop logic for the sigmoid case. The derivative of sigmoid is:
A1 * (1 - A1)
Not
sigmoid(A1) * (1 - sigmoid(A1))
I just converted to using sigmoid in the hidden layer and then ran just the standard training with the n = 4 hidden layer and with no other changes or hyperparameter tuning it works almost as well as tanh with 89% accuracy instead of 90%:
I ran the training on all the various layer sizes and with the correct back prop code for sigmoid in the hidden layer, it looks like you hit a “good enough” performance at n = 3. Beyond that it’s just costing you more compute, but is not generating any better performance.
I did ReLU as well and there my results look exactly like yours. The one observation is that on the standard n = 4 case, the learning is really slow. So maybe fiddling with more iterations or a higher learning rate would be worth it, although the LR actually defaults to 1.2 here which is pretty high. But then I just ran 1 - 50 unit test cases with no hyperparameter changes and the n = 50 case actually works pretty well with 85% accuracy:
But the training takes forever. It might be worth a bit more fiddling to see if there’s a sweet spot between n = 20 and n = 50 where we could get good accuracy with reasonable compute cost. The other thing to notice there is that the n = 50 ReLU case actually takes a fundamentally different approach than either sigmoid or tanh at trying to discriminate that cluster of red dots right at the origin. So it looks like maybe it actually does give a qualitatively different solution. Maybe we could get it to do something similar with the blue dots in the upper center of the picture if we gave it even more neurons to work with. Or as @kenb said, maybe the smarter thing would be to try 2 hidden layers. We could probably get much greater complexity with fewer than 50 total neurons. Worth another try after we finish Week 4 and learn how to build the fully general case!
For the ReLu case, I managed to get 87% with 100 hidden neurons and 0.2 learning rate. I cannot get more with either more neurons or more learning rate (actually learning rate larger than 0.3 will lead to “bouncing behavior” and the cost will never converge).
The take home message is that activation functions do have significant impact on training. Although they all contribute non-linearity to the model, they act in distinct ways. Maybe someone familiar with convex optimization will shed more light on this.
Yes, Albert Zhang, you are very true that activation functions do play a significant role in one or the other ways in training a model.
In this course, Prof Ng has tried to use ReLU as an activation function, as it is the most common function used for hidden layers. This is just because, it’s easier to implement and best at overcoming limitations of other activation functions. In the later courses, we come across the implementation of tanH as well.
The activation functions used in the hidden layer is mainly chosen on the basis of the kind of NN architecture. The modern NN with common architecture such as MLP (Multilayer Perceptron) and CNN will generally use ReLU, whereas, RNN will make use of Tanh or sigmoid activations.
The output layer typically uses a different activation function from the hidden layers. It depends on the kind of predictions required by the model. In general principle, either sigmoid or softmax activation functions are used.