Trying to replicate Week 4 deep network

Hi, guys, I’m basically trying to replicate the same ReLU → Sigmoid binary classifier from the Week 4 programming assignment on my own data and encountered a number of problems.

  1. ReLU derivative implementation is supplied at the assignment, but is it correct that ReLU(a * x)’ = a?
  2. After a number of iterations (2 or 3) I’m getting a very high ReLU output (10.0+) which makes sigmoid output very close to 1, which subsequantely makes log(1-y) = -inf in loss function. Is it something wrong with my backprop or should I somehow limit ReLU?

Thanks in advance!

  1. I don’t understand the question.
  2. Try normalizing the data set.
  1. derivative of f(a * x) equals to a * dx where a is const, and f is ReLU
  2. Data set is normalized, weights are initialized to random small values

Perhaps ReLU is not a good choice of activation function for that data set.
Try using sigmoid() in the hidden layer instead.

Or, perhaps your learning rate is too high.

As you say, they gave you all the logic to deal with backprop for ReLU in the hidden layers and sigmoid at the output layer. Of course if you alter the architecture as Tom suggests to use sigmoid also in the hidden layers, then you need to make sure to adjust the backprop logic.

The derivative if ReLU is most compactly expressed as:

g'(Z) = (Z > 0)

So you get 1 for all values of Z > 0 and 0 for all values of Z \leq 0 although technically the derivative of ReLU is undefined at Z = 0.

Already tried to adjust learning rate, didn’t help. Is there special cases, when ReLU is not a good choice? I’ve thought that ReLU is almost always a good choice except for a output layer for classification task

I’m not using code from the lab per se, I’m just trying to replicate the algorithm as I’ve understood it. Thanks for the tip, I guess I just used linear function derivative instead of ReLU’s, I’ll try adjust my code and see if that will help

You can also examine their code for relu_backward and notice that they are rolling several computations into that one function, not just the derivative of ReLU.

1 Like

ReLU isn’t a very good activation function, because it’s non-linearity is mostly trivial.

It’s also prone to getting stuck, because its output is 0 for all negative inputs.

The only advantage ReLU has is that the gradients are very easy to compute.

But because of the “dead unit” syndrome, you need a lot more ReLU units to do the job of a single sigmoid() or tanh() unit.

1 Like

I don’t think you can make that general a statement. But the standard practice is to try ReLU as your first choice for the hidden layer activations. That is because it is by far the cheapest to compute of any activation function. So if it works, it’s a big win. Why wouldn’t you choose that if it works? But it doesn’t always work. If not, then you try Leaky ReLU, which is almost as cheap as ReLU and doesn’t have the “dead neuron” problem. If that also doesn’t give you good results, then you graduate to more expensive functions like tanh, sigmoid and swish.

Hi @ales.veshtort

Yes, when x is positive.

To find out if something’s wrong with your backprop, you need to inspect all those intermediate variables along the way and see if they all match with your expectation. Here is how I would do it:

  1. Use Tensorflow and see if it reproduces a similar trend. Make sure everything is as close as possible.

  2. Inspecting intermediate variables is a labour work, but to make it lighter, I would adjust my architecture to as simple as possible, then verify that simple architecture still has the same problem, and then start the inspection.

If your backprop is OK, then you know what to expect from the above two works.

If I were you, I would not consider any other activation unless I had made sure my backprop is OK, because I would want to build on a solid foundation.

Good luck!
Raymond

1 Like

Thanks, everyone! I’ll try the approaches you suggested, thanks again for quick support!