I’m working on the Week 3 Programming Assignment of Course 1 in Deep Learning and encountering a numerical stability issue when experimenting with ReLU and Leaky ReLU activation functions in a neural network. Below are my custom implementations of these functions:

def relu(Z):
return np.maximum(0, Z)

def leakyRelu(Z):
return np.maximum(-0.01*Z, Z)

When I use these functions in my network for the planar dataset, I observe that all variables (such as A1, Z1, A2, Z2, dW1, etc.) eventually turn into NaN values. The problem seems to emerge during the backward propagation, specifically at this line:

I did not encounter this issue when using sigmoid or tanh functions. My network implementation seems correct overall, as I received full marks in the assignment submission. This issue arose while I was trying to extend the assignment with different activation functions.

Could anyone help me understand why this is happening with ReLU/Leaky ReLU and suggest possible solutions to fix this issue?

The image attached shows the differences between tanh, sigmoid and Relu. As you can see, if you are using Relu, the output value is either 0 or positive value. In you case, you have masked out any negative value when applying Relu to Z1.

Thank you for your prompt reply and the explanation regarding the behavior of the ReLU function.

I understand that ReLU outputs either 0 or a positive value, and this characteristic can lead to large positive values in A1 if Z1 is large. My concern is that these large values in A1 seem to be contributing to the NaN issue in my network, particularly during the computation of dZ1 in the backward propagation:

Limiting ReLU Output: Is there a way to modify or limit the ReLU function so that A1 does not become excessively large? but that will not be ReLU function then.

Alternative Approaches: If the issue is not solely with the ReLU function, are there other strategies I should consider to prevent these NaN occurrences?

I’m trying to understand the root cause of the problem and explore potential solutions to prevent NaN values in the network’s computations.
During videos, I was told that ReLU should result better, but I can’t make it work.
Any further guidance or suggestions would be greatly appreciated.

Given that this is a shallow neural network with 1 hidden layer and 1 output layer, where the size of the hidden layer is only 4 neurons/units. If Relu is used as an activation function on the hidden layer, then any negative input will not be contributed to the output of the hidden layer, reducing the network overall capacity to learn. The NaN value is showing some neurons may have died because of this.

The job of a neural network is to learn the complexity of the input data. The choice of which activation function to use is depending on the problem you are trying to solve. In course2, Prof. Ng will discuss various techniques used in training a network model. I highly recommend course2 for building a firm understanding and foundation.

In addition to other replies on this thread, here are two thoughts:

When you change the activation functions, you also need to change how the gradients are calculated.

When you use ReLU instead of sigmoid or tanh, you need to use a lot more units, because many of the ReLU units are going to become dead (since they give zero or very small outputs for all negative values.

It’s great that you are doing these experiments, but one thing to note is that the activation function comes into play in two ways: during forward prop and also during back prop. Remember that’s what that g'(Z) term is in the backprop calculation. So it looks like you are using ReLU in forward prop, but the derivative of tanh in backprop. That doesn’t make any sense, right?