Hi, guys, I’m basically trying to replicate the same ReLU → Sigmoid binary classifier from the Week 4 programming assignment on my own data and encountered a number of problems.
ReLU derivative implementation is supplied at the assignment, but is it correct that ReLU(a * x)’ = a?
After a number of iterations (2 or 3) I’m getting a very high ReLU output (10.0+) which makes sigmoid output very close to 1, which subsequantely makes log(1-y) = -inf in loss function. Is it something wrong with my backprop or should I somehow limit ReLU?
As you say, they gave you all the logic to deal with backprop for ReLU in the hidden layers and sigmoid at the output layer. Of course if you alter the architecture as Tom suggests to use sigmoid also in the hidden layers, then you need to make sure to adjust the backprop logic.
The derivative if ReLU is most compactly expressed as:
g'(Z) = (Z > 0)
So you get 1 for all values of Z > 0 and 0 for all values of Z \leq 0 although technically the derivative of ReLU is undefined at Z = 0.
Already tried to adjust learning rate, didn’t help. Is there special cases, when ReLU is not a good choice? I’ve thought that ReLU is almost always a good choice except for a output layer for classification task
I’m not using code from the lab per se, I’m just trying to replicate the algorithm as I’ve understood it. Thanks for the tip, I guess I just used linear function derivative instead of ReLU’s, I’ll try adjust my code and see if that will help
You can also examine their code for relu_backward and notice that they are rolling several computations into that one function, not just the derivative of ReLU.
I don’t think you can make that general a statement. But the standard practice is to try ReLU as your first choice for the hidden layer activations. That is because it is by far the cheapest to compute of any activation function. So if it works, it’s a big win. Why wouldn’t you choose that if it works? But it doesn’t always work. If not, then you try Leaky ReLU, which is almost as cheap as ReLU and doesn’t have the “dead neuron” problem. If that also doesn’t give you good results, then you graduate to more expensive functions like tanh, sigmoid and swish.
To find out if something’s wrong with your backprop, you need to inspect all those intermediate variables along the way and see if they all match with your expectation. Here is how I would do it:
Use Tensorflow and see if it reproduces a similar trend. Make sure everything is as close as possible.
Inspecting intermediate variables is a labour work, but to make it lighter, I would adjust my architecture to as simple as possible, then verify that simple architecture still has the same problem, and then start the inspection.
If your backprop is OK, then you know what to expect from the above two works.
If I were you, I would not consider any other activation unless I had made sure my backprop is OK, because I would want to build on a solid foundation.