Adding layers trigger Biasedness


Thanks for the great content. I am currently trying to understand bias and variance in my own data. From the course, I understood that bias can be counteracted by increasing the neural network (adding layers, and units).

I am currently working on a neural network that predicts whether something should be alarmed or not. A NN with the dimensions 8 , 7 ,7, 1 has an accuracy of approx 96 % in both training and test set.

I was wondering whether one could increase the accuracy by making the model more complex. Hence, I started adding additional layers. However, if one or more layers are added, the accuracy in both training and test decreases to 50 % . Basically no learning is taking place.

Can you explain or provide some resources , why adding an additional layer can affect the model output that drastically ?


1 Like

Maybe plotting training and validation (test) losses answer your question.

1 Like

thank you for the feedback, I did this already. The costs stay more or less constant. I am very much wondering how such behaviour can be triggered when adding layers.

1 Like

I recommend you only add a 2nd hidden layer after you’ve tried everything to use just one.

Typically you can fix a bias problem by adding more units to the hidden layer, as that lets it learn a more complex model.

You also may need to try a different activation function.

The more hidden layers you add, the more difficult is the training. You may need a lot more training data, or train for a lot more epochs, use a different learning rate, etc.

Can you post some images of your result plots?

1 Like

Thank you very much for the reply.

These are my Cost graphs . First with the layer dims = 8,7,7,1 and the second with the layer dims = 8,7,7,3,1

Thank you !!

Thanks for the plots.

How confident are you that your model with three hidden layers is working correctly? I.e, is backpropagation correctly implemented? You have not said much about what methods you are using (hand-coded, TensorFlow, etc).

How large is your training set?

Hi .

This was my thought, too. Maybe something else is wrong and it is just showing us when more layers are added. I will try to implement gradient checking and come back here if everything seems fine .

It is really helpful to receive feedback like this.

Thank you !

@Victoria_Schroeder, can you provide some information on this?


Sure, it is hand-coded. I applied an adapted version of the last assignment of “Neural Networks and Deep Learning” . The hidden layers have a relu activation function and the last layer has a sigmoid activation function.



So i implemented Gradient Checking . Took the code from the assignment of this specialisation (week 1) . That is the functions : forward_propagation_n(X, Y, parameters) , backward_propagation_n(X, Y, cache) and gradient_check_n(parameters, gradients, X, Y, epsilon=1e-7, print_msg=False) and changed the input to fit my model dimensions .

that is, instead of a 3 layer NN with layer_dims = 4,5,3,1 with the parameters :
parameters = {}
parameters[“W1”] = theta[: 20].reshape((5, 4))
parameters[“b1”] = theta[20: 25].reshape((5, 1))
parameters[“W2”] = theta[25: 40].reshape((3, 5))
parameters[“b2”] = theta[40: 43].reshape((3, 1))
parameters[“W3”] = theta[43: 46].reshape((1, 3))
parameters[“b3”] = theta[46: 47].reshape((1, 1))

i adjusted it to a simpler version of my model where X = 6 and hence layer dims = (6,5,3,1) and the parameters:
parameters = {}
parameters[“W1”] = theta[: 30].reshape((5, 6))
parameters[“b1”] = theta[30 : 35].reshape((5, 1))
parameters[“W2”] = theta[35: 50].reshape((3, 5))
parameters[“b2”] = theta[50: 53].reshape((3, 1))
parameters[“W3”] = theta[53: 56].reshape((1, 3))
parameters[“b3”] = theta[56: 57].reshape((1, 1))

the input : parameters were taken from the function initialize_parameters_deep(layers_dims) from the previous section of this specialisation.

according to gradient checking there is a problem in the code. Hence , i am checking the difference between grad and gradapprox more carefully by :
for l in range(0, len(grad)-1):
print(str(grad[l] - gradapprox[l]) + " this is " + str(l))

the difference between grad[l] - gradapprox[l]) is usually <= e-10. Only the values corresponding to b2 differ by e-5.

Since , i took the code from the assignment (where the 2 errors were removed) , i dont expect that the mistake lies in the code from the assignment (also i passed the assignment 100%). Hence, the mistake can only be due to the initialisation.

i am quite stuck here, since i would have expected that the correct code from the assignment would produce a gradient check that would show a result < e-7.

it is a pity that one cannot paste any code here. it would be very interesting to understand what is the difference between the assignment and my current example.

Any idea ?

Can you explain a bit about this theta variable?

1 Like

Also, when you say you got the initialization code from “the previous section of this specialization”, what does that mean?

The Deep Learning Specialization is five courses, of 3 or 4 weeks each, and each week has multiple programming assignments. So, please be specific.

1 Like

sure ,

however, first i need to ask a quick question. I just did the same analysis . that is :

parameters, costs = L_layer_model(X_train, y_train, layers_dims, num_iterations = 5000, print_cost = False)
cost, cache = forward_propagation_n(X_train, y_train, parameters)
gradients = backward_propagation_n(X_train, y_train, cache)
difference, grad,gradapprox = gradient_check_n(parameters, gradients, X_train, y_train, 1e-7, True)

so i took the parameters , which are used to create theta (dictionary_to_vector(parameters) ) from the output of my model instead of the output of the parameter initialisation. Now gradient checking works fine . that is the result is :
Your backward propagation works perfectly fine! difference = 7.462864222724888e-09

Hence, it was wrong to use the parameters from the parameter initialisation? I think , that i did not understand the gradient checking properly. In oder words, why does the gradient checking depend on the values of the parameters ?

1 Like

You can use any version of your parameters for gradient checking, but note that if you use the results of initialization than all the bias parameters (b^{[l]}) are initialized to zero, right? That’s not a good idea. You could just use random values for all the parameters and get valid results. Gradient checking does not depend on the parameters themselves being good values or not: it’s just checking the math of your gradient calculations. But you need all the parameters to be non-zero. Using the results of training should be fine, since the bias values will no longer be zero.

1 Like

If you want to share your code to get a “second opinion” on it, we could create a DM thread with you, Tom, Saif and me.

Check your DMs. You can recognize the difference by the little “envelope” icon that Discourse uses for a DM thread.

1 Like

I think Victoria means the theta that we use in the DLS C2 W1 Gradient Checking assignment to “unroll” all the parameters into a single vector for the purposes of “tweaking” each individual parameter by +\epsilon and -\epsilon to do the finite difference approximation of the gradient.

1 Like

Hello @Victoria_Schroeder,

I am late to this thread and haven’t yet quite followed everything that happened, but the moment I saw something like the above, I would print all of the gradient values for the debugging of my algorithm. My reason is, no cost change may be due to no weights updating, so perhaps some gradients were mistakenly zeroed out, or the updating was not right. Printing the gradients is only the first step, it should lead me to printing more until I have a visual on how the update process went wrong or what led the cost to stay unchanged.

If a top-down inspection of the code doesn’t get you out, a bottom-up inspection of the process and the values might make a difference :wink:


1 Like

This reply is fantastic. It just made my day. thank you !

The failing gradient checking was due to the fact the the bias parameters (b[l]) were initialized to zero.

Since i used the code from the DLS C2 W1 Gradient Checking assignment and just used different parameters , X and Y input , i would now like to build the gradient checking into my original code. However, the math is similar. So it should not change the results.

If i then get the probable results that gradient checking went well, I am still left with the question , why adding a layer to the model stops the model from learning.

Thank you very much for helping me out. Please enjoy your holidays and if the issue still remains, i will get in contact .

1 Like

thank you very much for the reply. I will try to do that and hopefully this will give some insights, why no learning is taking place when adding more than 2 hidden layers to the model .

1 Like

Adding layers should not stop the model from learning. Assuming your implementation is correct, learning with multiple hidden layers may take a lot longer, you might need to jiggle the initial weight values, adjust the learning rate, etc.

I wonder if you would consider implementing your model in TensorFlow, as a comparison. TensorFlow will handle the gradients, backpropagation, and cost calculations automatically. There is a lot less to go wrong.

This might provide some clues as to whether the issue is your model implementation, or maybe some complexity with your dataset.

1 Like