Adding layers trigger Biasedness

Victoria_Schroeder · November 17, 2023, 12:25pm

Hi

Thanks for the great content. I am currently trying to understand bias and variance in my own data. From the course, I understood that bias can be counteracted by increasing the neural network (adding layers, and units).

I am currently working on a neural network that predicts whether something should be alarmed or not. A NN with the dimensions 8 , 7 ,7, 1 has an accuracy of approx 96 % in both training and test set.

I was wondering whether one could increase the accuracy by making the model more complex. Hence, I started adding additional layers. However, if one or more layers are added, the accuracy in both training and test decreases to 50 % . Basically no learning is taking place.

Can you explain or provide some resources , why adding an additional layer can affect the model output that drastically ?

Cheers
Victoria

saifkhanengr · November 17, 2023, 1:30pm

Maybe plotting training and validation (test) losses answer your question.

Victoria_Schroeder · November 17, 2023, 3:25pm

thank you for the feedback, I did this already. The costs stay more or less constant. I am very much wondering how such behaviour can be triggered when adding layers.

TMosh · November 17, 2023, 5:36pm

I recommend you only add a 2nd hidden layer after you’ve tried everything to use just one.

Typically you can fix a bias problem by adding more units to the hidden layer, as that lets it learn a more complex model.

You also may need to try a different activation function.

The more hidden layers you add, the more difficult is the training. You may need a lot more training data, or train for a lot more epochs, use a different learning rate, etc.

Can you post some images of your result plots?

Victoria_Schroeder · November 20, 2023, 9:48am

Thank you very much for the reply.

These are my Cost graphs . First with the layer dims = 8,7,7,1 and the second with the layer dims = 8,7,7,3,1

Thank you !!

TMosh · November 20, 2023, 3:39pm

Thanks for the plots.

How confident are you that your model with three hidden layers is working correctly? I.e, is backpropagation correctly implemented? You have not said much about what methods you are using (hand-coded, TensorFlow, etc).

How large is your training set?

Victoria_Schroeder · November 21, 2023, 9:06am

Hi .

This was my thought, too. Maybe something else is wrong and it is just showing us when more layers are added. I will try to implement gradient checking and come back here if everything seems fine .

It is really helpful to receive feedback like this.

Thank you !

TMosh · November 21, 2023, 3:27pm

@Victoria_Schroeder, can you provide some information on this?

Victoria_Schroeder · November 22, 2023, 1:28pm

Hi

Sure, it is hand-coded. I applied an adapted version of the last assignment of “Neural Networks and Deep Learning” . The hidden layers have a relu activation function and the last layer has a sigmoid activation function.

Thanks!

Victoria_Schroeder · November 22, 2023, 3:41pm

Hi

So i implemented Gradient Checking . Took the code from the assignment of this specialisation (week 1) . That is the functions : forward_propagation_n(X, Y, parameters) , backward_propagation_n(X, Y, cache) and gradient_check_n(parameters, gradients, X, Y, epsilon=1e-7, print_msg=False) and changed the input to fit my model dimensions .

that is, instead of a 3 layer NN with layer_dims = 4,5,3,1 with the parameters :
parameters = {}
parameters[“W1”] = theta[: 20].reshape((5, 4))
parameters[“b1”] = theta[20: 25].reshape((5, 1))
parameters[“W2”] = theta[25: 40].reshape((3, 5))
parameters[“b2”] = theta[40: 43].reshape((3, 1))
parameters[“W3”] = theta[43: 46].reshape((1, 3))
parameters[“b3”] = theta[46: 47].reshape((1, 1))

i adjusted it to a simpler version of my model where X = 6 and hence layer dims = (6,5,3,1) and the parameters:
parameters = {}
parameters[“W1”] = theta[: 30].reshape((5, 6))
parameters[“b1”] = theta[30 : 35].reshape((5, 1))
parameters[“W2”] = theta[35: 50].reshape((3, 5))
parameters[“b2”] = theta[50: 53].reshape((3, 1))
parameters[“W3”] = theta[53: 56].reshape((1, 3))
parameters[“b3”] = theta[56: 57].reshape((1, 1))

the input : parameters were taken from the function initialize_parameters_deep(layers_dims) from the previous section of this specialisation.

according to gradient checking there is a problem in the code. Hence , i am checking the difference between grad and gradapprox more carefully by :
for l in range(0, len(grad)-1):
print(str(grad[l] - gradapprox[l]) + " this is " + str(l))

the difference between grad[l] - gradapprox[l]) is usually <= e-10. Only the values corresponding to b2 differ by e-5.

Since , i took the code from the assignment (where the 2 errors were removed) , i dont expect that the mistake lies in the code from the assignment (also i passed the assignment 100%). Hence, the mistake can only be due to the initialisation.

i am quite stuck here, since i would have expected that the correct code from the assignment would produce a gradient check that would show a result < e-7.

it is a pity that one cannot paste any code here. it would be very interesting to understand what is the difference between the assignment and my current example.

Any idea ?

TMosh · November 22, 2023, 4:03pm

Can you explain a bit about this theta variable?

TMosh · November 22, 2023, 4:16pm

Also, when you say you got the initialization code from “the previous section of this specialization”, what does that mean?

The Deep Learning Specialization is five courses, of 3 or 4 weeks each, and each week has multiple programming assignments. So, please be specific.

Victoria_Schroeder · November 22, 2023, 4:17pm

sure ,

however, first i need to ask a quick question. I just did the same analysis . that is :

parameters, costs = L_layer_model(X_train, y_train, layers_dims, num_iterations = 5000, print_cost = False)
cost, cache = forward_propagation_n(X_train, y_train, parameters)
gradients = backward_propagation_n(X_train, y_train, cache)
difference, grad,gradapprox = gradient_check_n(parameters, gradients, X_train, y_train, 1e-7, True)

so i took the parameters , which are used to create theta (dictionary_to_vector(parameters) ) from the output of my model instead of the output of the parameter initialisation. Now gradient checking works fine . that is the result is :
Your backward propagation works perfectly fine! difference = 7.462864222724888e-09

Hence, it was wrong to use the parameters from the parameter initialisation? I think , that i did not understand the gradient checking properly. In oder words, why does the gradient checking depend on the values of the parameters ?

paulinpaloalto · November 22, 2023, 4:28pm

You can use any version of your parameters for gradient checking, but note that if you use the results of initialization than all the bias parameters (b^{[l]}) are initialized to zero, right? That’s not a good idea. You could just use random values for all the parameters and get valid results. Gradient checking does not depend on the parameters themselves being good values or not: it’s just checking the math of your gradient calculations. But you need all the parameters to be non-zero. Using the results of training should be fine, since the bias values will no longer be zero.

paulinpaloalto · November 22, 2023, 4:31pm

If you want to share your code to get a “second opinion” on it, we could create a DM thread with you, Tom, Saif and me.

Check your DMs. You can recognize the difference by the little “envelope” icon that Discourse uses for a DM thread.

paulinpaloalto · November 22, 2023, 4:40pm

I think Victoria means the theta that we use in the DLS C2 W1 Gradient Checking assignment to “unroll” all the parameters into a single vector for the purposes of “tweaking” each individual parameter by +\epsilon and -\epsilon to do the finite difference approximation of the gradient.

rmwkwok · November 22, 2023, 8:22pm

Hello @Victoria_Schroeder,

I am late to this thread and haven’t yet quite followed everything that happened, but the moment I saw something like the above, I would print all of the gradient values for the debugging of my algorithm. My reason is, no cost change may be due to no weights updating, so perhaps some gradients were mistakenly zeroed out, or the updating was not right. Printing the gradients is only the first step, it should lead me to printing more until I have a visual on how the update process went wrong or what led the cost to stay unchanged.

If a top-down inspection of the code doesn’t get you out, a bottom-up inspection of the process and the values might make a difference

Cheers,
Raymond

Victoria_Schroeder · November 22, 2023, 8:30pm

This reply is fantastic. It just made my day. thank you !

The failing gradient checking was due to the fact the the bias parameters (b[l]) were initialized to zero.

Since i used the code from the DLS C2 W1 Gradient Checking assignment and just used different parameters , X and Y input , i would now like to build the gradient checking into my original code. However, the math is similar. So it should not change the results.

If i then get the probable results that gradient checking went well, I am still left with the question , why adding a layer to the model stops the model from learning.

Thank you very much for helping me out. Please enjoy your holidays and if the issue still remains, i will get in contact .

Victoria_Schroeder · November 22, 2023, 8:33pm

thank you very much for the reply. I will try to do that and hopefully this will give some insights, why no learning is taking place when adding more than 2 hidden layers to the model .

TMosh · November 22, 2023, 8:45pm

Adding layers should not stop the model from learning. Assuming your implementation is correct, learning with multiple hidden layers may take a lot longer, you might need to jiggle the initial weight values, adjust the learning rate, etc.

I wonder if you would consider implementing your model in TensorFlow, as a comparison. TensorFlow will handle the gradients, backpropagation, and cost calculations automatically. There is a lot less to go wrong.

This might provide some clues as to whether the issue is your model implementation, or maybe some complexity with your dataset.

Topic		Replies	Views
How does increasing the neural networks layers (making it bigger) help reduce the bias? Advanced Learning Algorithms week-2	3	344	October 5, 2023
Adding a dense layer Convolutional Neural Networks coursera-platform	10	734	August 22, 2022
Course 1, week 4 assignment 2 Neural Networks and Deep Learning coursera-platform	1	541	December 22, 2021
C2_W3_Bias/variance and neural networks Advanced Learning Algorithms week-3	1	181	April 7, 2024
About improving the classification Custom and Distributed Training with TF week-4	3	518	September 20, 2022

Adding layers trigger Biasedness

Related topics