W3_A1_Implementing ReLu in NN

ph.ai · January 4, 2024, 12:56pm

Hello everyone,

I’m working on the Week 3 Programming Assignment of Course 1 in Deep Learning and encountering a numerical stability issue when experimenting with ReLU and Leaky ReLU activation functions in a neural network. Below are my custom implementations of these functions:

def relu(Z):
return np.maximum(0, Z)

def leakyRelu(Z):
return np.maximum(-0.01*Z, Z)

When I use these functions in my network for the planar dataset, I observe that all variables (such as A1, Z1, A2, Z2, dW1, etc.) eventually turn into NaN values. The problem seems to emerge during the backward propagation, specifically at this line:

dZ1 = np.multiply(np.dot(W2.T, dZ2), (1 - np.power(A1, 2)))

This is likely due to np.power(A1, 2) where A1 is calculated using the ReLU function as shown:

Z1 = np.dot(W1, X) + b1
A1 = relu(Z1) # or A1 = leakyRelu(Z1)

I did not encounter this issue when using sigmoid or tanh functions. My network implementation seems correct overall, as I received full marks in the assignment submission. This issue arose while I was trying to extend the assignment with different activation functions.

Could anyone help me understand why this is happening with ReLU/Leaky ReLU and suggest possible solutions to fix this issue?

Thank you in advance!

Kic · January 4, 2024, 1:39pm

Hi @ph.ai ,

The image attached shows the differences between tanh, sigmoid and Relu. As you can see, if you are using Relu, the output value is either 0 or positive value. In you case, you have masked out any negative value when applying Relu to Z1.

ph.ai · January 4, 2024, 2:09pm

Thank you for your prompt reply and the explanation regarding the behavior of the ReLU function.

I understand that ReLU outputs either 0 or a positive value, and this characteristic can lead to large positive values in A1 if Z1 is large. My concern is that these large values in A1 seem to be contributing to the NaN issue in my network, particularly during the computation of dZ1 in the backward propagation:

dZ1 = np.multiply(np.dot(W2.T, dZ2), (1 - np.power(A1, 2)))

Given this, I have a couple of questions:

Limiting ReLU Output: Is there a way to modify or limit the ReLU function so that A1 does not become excessively large? but that will not be ReLU function then.
Alternative Approaches: If the issue is not solely with the ReLU function, are there other strategies I should consider to prevent these NaN occurrences?

I’m trying to understand the root cause of the problem and explore potential solutions to prevent NaN values in the network’s computations.
During videos, I was told that ReLU should result better, but I can’t make it work.
Any further guidance or suggestions would be greatly appreciated.

Kic · January 4, 2024, 3:41pm

Hi @ph.ai ,

Given that this is a shallow neural network with 1 hidden layer and 1 output layer, where the size of the hidden layer is only 4 neurons/units. If Relu is used as an activation function on the hidden layer, then any negative input will not be contributed to the output of the hidden layer, reducing the network overall capacity to learn. The NaN value is showing some neurons may have died because of this.

The job of a neural network is to learn the complexity of the input data. The choice of which activation function to use is depending on the problem you are trying to solve. In course2, Prof. Ng will discuss various techniques used in training a network model. I highly recommend course2 for building a firm understanding and foundation.

TMosh · January 4, 2024, 4:54pm

In addition to other replies on this thread, here are two thoughts:

When you change the activation functions, you also need to change how the gradients are calculated.
When you use ReLU instead of sigmoid or tanh, you need to use a lot more units, because many of the ReLU units are going to become dead (since they give zero or very small outputs for all negative values.

paulinpaloalto · January 4, 2024, 4:55pm

It’s great that you are doing these experiments, but one thing to note is that the activation function comes into play in two ways: during forward prop and also during back prop. Remember that’s what that g'(Z) term is in the backprop calculation. So it looks like you are using ReLU in forward prop, but the derivative of tanh in backprop. That doesn’t make any sense, right?

Topic		Replies	Views
Relu/LRelu does not work for forward propagation in Planar_data_classification_with_one_hidden_layer Neural Networks and Deep Learning coursera-platform	10	841	November 14, 2021
Trying to replicate Week 4 deep network Neural Networks and Deep Learning coursera-platform	11	409	October 12, 2023
How to apply relu function in Exercise of week 3(optional).) Neural Networks and Deep Learning coursera-platform	5	540	July 12, 2023
Course1 - Week3 Assignment - ReLU gave worse performance than tanh Neural Networks and Deep Learning coursera-platform	3	550	September 9, 2021
W3 A1 Relu Activation doesn't work Neural Networks and Deep Learning week-module-3 , coursera-platform	2	38	October 29, 2024

W3_A1_Implementing ReLu in NN

Related topics