Spikes in cost function plot for deep "relu" nn

Hi, I was wondering if someone could help me out with this:

  • I tried building from scratch a deep NN for MNIST
  • the network uses “relu” as an activation function for the hidden layers and softmax for the output layer
  • the training is done in batch (i.e. gradient descent is performed on the entire training set)
  • I use the following for cost computation: self.cost = - np.mean(self.y * np.log(self.output + 0.000000001))
  • I get 96% accuracy on most test and training set

When I plot the cost function (cost is plotted at each epoch) I get a “spiky” plot

I wonder: (1) Is this acceptable? (2) I am puzzled as to what could cause these spikes…

1 Like

It’s great that you are taking the ideas here and really applying them. You always learn something when you do that. There is never any guarantee that convergence will be monotonic with a fixed learning rate. It looks like you end up with accurate solutions, so it seems acceptable in the overall picture.

Prof Ng tells us that the fact that ReLU is not differentiable at z = 0 is not a problem, but I remember once seeing someone do some experiments that seemed to suggest that might introduce some instability. Since you’re in the mode of experimenting and understanding things, it would be interesting to see of you get the same spiky behavior with a different activation function. E.g. try sigmoid or tanh and see if that makes a difference in that respect. Of course the compute cost of your training will be higher at least on a per iteration basis, but maybe you can get away with fewer iterations. Seems like a worthy experiment to run. Let us know if that has any effect on the spiky behavior.


Thanks! The fact that ReLu is not differentiable at 0 shouldn’t be an issue because my implementation of the derivative takes care of that (as per professor Ng’s suggestion). Sigmoid doesn’t give the same behaviour, but converges much more slowly and with much less accuracy…

Interesting! Thanks for the followup. Well maybe you can get away with a higher learning rate with sigmoid because the behavior is smoother. The various hyperparameters are not necessarily statistically independent, right? Was the curve you got with sigmoid spiky at all? Did you also try Leaky ReLU?

1 Like

Of course what we are doing here is the most straightforward version of Gradient Descent. Once we switch to using packages (TensorFlow) in Course 2, we’ll be using the more sophisticated internal conjugate gradient methods which manage the learning rate adaptively for optimized convergence.

Prof Ng is giving us the understanding of how iterative approximation methods work because that is valuable intuition to have, but there is just too much to cover in the first course to get into the more sophisticated methods. He’ll introduce us to a couple of ways to use adaptive learning rates in Course 2 also.

1 Like

Yeah, thanks. I finished courses 2 and 3 and I just thought reimplementing everything from scratch might be a good idea to better assimilate the materials. I will try sigmoid with a higher alpha.

This is just a quick reaction as I’m not in front of a computer with Python, but is mean what you really want for cross entropy loss ? Vs sum ? Apologies in advance if I missed something obvious.

Ps: the upside of using packages is you can use a prebuilt loss function. The downside is it is easy to forget details of their implementation (hmmm, is this one computing the average for the sample?)

Normally the cost J is the average of the loss across the samples, but I’m not sure whether that is what’s being computed here. But if you minimize \frac {1}{m} * J, you’ve also minimized J. But the other approach would be to convert to using TF/Keras, then you could just use their builtin categorical cross entropy loss with from_logits = True mode and then you also don’t have to do the funky thing of adding 0.000000001 to avoid exploding and catching fire when you saturate softmax. They take care of that for you in a more sophisticated way and they also do all the magic adaptive learning rates.

But if the point of the exercise is to gain the most knowledge and intuition about what is going on “under the hood”, then maybe doing it directly in python and numpy is the most educational approach and switching to TF is the equivalent of dropping back 5 yards and punting on third and long. Depends on the goal I guess …

1 Like

Yup. Looks like the formula for computing cross entropy doesn’t average. But the formula for optimizing log loss does. Been a while since I looked under the covers or hand wrote this formula :frowning_face:

I’ll try to play with the MNIST classifiers myself and see if I can replicate this behavior.

Yeah. I get a different cost but the behaviour of the function is the same (and yes, you are right, should have had the sum).

I cannot imagine either of you have time to waste on this but I suspect my implementation is flawed. I can’t make it work with sigmoid and I get suspiciously good results with ReLu (with the spiky cost function). At any rate, this is is what I get: nn_from_scratch/Neural Network from Scratch.ipynb at main · bsassoli/nn_from_scratch · GitHub

I imagine this has already occurred to you, but the most obvious thing to check is that changing an activation function does not just involve forward propagation, right? You also have to make sure that the derivative of the new function is handled correctly during back prop …

Yes, that’s it: I just took a quick look and your implementation of the derivative of sigmoid is wrong. You have:

sigmoid(z) * sigmoid(1 - z)

The correct implementation is:

sigmoid(z) * (1 - sigmoid(z))

:man_facepalming: :laughing:


Good find @paulinpaloalto

@bsassoli, after fixing that, try experimenting with different learning rates. I have your code running locally, and can produce curves like these…

Screen Shot 2021-10-30 at 3.03.52 PM

Screen Shot 2021-10-30 at 2.55.48 PM


Screen Shot 2021-10-30 at 3.16.48 PM

I haven’t looked at the data management code yet, but ‘suspiciously good’ results can derive from leakage of training data into test. Another suggestion is to shrink the size of your test data at the beginning, until you’re reasonably confident the code is correct. Right now it takes ~25 minutes to complete a training run. Unless it suits your lifestyle to start a run and then go do something else and come back, try smaller data set or fewer epochs to start.

Thanks both @ai_curious and @paulinpaloalto !
@ai_curious yeah I am using a 10k examples dataset for testing. The ‘suspiciously good’ results are bothering me too. I do get them with ReLu and not with sigmoid though…
I am implementing all your suggestions these minutes. :slight_smile:

@bsassoli are you still working on this? On my local copy I instrumented your forward prop method as follows:

    def forward(self, activation="relu"):
        self.layers["a0"] = self.X
        for l in range(1, self.L):
            print('about to perform forward prop for layer: ' + str(l))
            self.layers["z" + str(l)] = np.dot(params["w" + str(l)], 
                                               self.layers["a"+str(l-1)]) + params["b"+str(l)]
            print('about to perform activation ' + activation + ' on layer: ' + str(l))
            self.layers["a" + str(l)] = eval(activation)(self.layers["z"+str(l)])
            assert self.layers["a"+str(l)].shape == (self.architecture[l], self.m)

        print('about to perform forward prop for layer: ' + str(self.L-1))
        self.layers["z" + str(self.L-1)] = np.dot(params["w" + str(self.L-1)],
                                                  self.layers["a"+str(self.L-2)]) + params["b"+str(self.L-1)]
        print('about to perform activation softmax on layer: ' + str(self.L-1))
        self.layers["a"+str(self.L-1)] = softmax(self.layers["z"+str(self.L-1)])

which resulted in the following trace:

about to perform forward prop for layer: 1
about to perform activation relu on layer: 1
about to perform forward prop for layer: 2
about to perform activation relu on layer: 2
about to perform forward prop for layer: 3
about to perform activation relu on layer: 3
about to perform forward prop for layer: 3
about to perform activation softmax on layer: 3
Epoch:   0 | Cost: 139474.886

Is that what you intended?

If you run with that instrumentation in place, you’ll also expose another sneaky bug. Namely, the forward function is using the local parameter variable activation in all invocations. Meaning it’s always running relu, and never sigmoid.

Try using this instead:

            print('about to perform activation ' + self.activation + ' on layer: ' + str(l))
            self.layers["a" + str(l)] = eval(self.activation)(self.layers["z"+str(l)])

self.activation instead of just activation.

It is used this way in back propagation (good news) but this means the forward and backward were mismatched when self.activation == sigmoid (not good news). I didn’t quantify the impact, but it might account for the low accuracy.

My Spidey-Sense is tingling over the other parts of the backpropagation function, but I haven’t had time to analyze it. Next time.

1 Like

Thank you so much. Yes, I had actually seen the bug and fixed it but didn’t push the commit (I will do so shortly). I have tried implementing leaky and tanh. And I tweaked the code so now it’s working better. But I am still perplexes at the “goodness” of the model (which now overfits… but does not in Relu and sigmoid exhibit the spikey stuff it did earlier).

I think so, depending on the number of hidden layers. That should be 2 hidden layers and 1 output layer, right?

Maybe review the lecture video Forward Propagation in a Deep Network up to about the 2:30 mark. Here is my attempt to capture the notation from that whiteboard…

a_0 == x_0
Z^{[1]} = W^{[1]} * a^{[0]} + b^{[1]}
a^{[1]} = relu(Z^{[1]})
Z^{[2]} = W^{[2]} * a^{[1]} + b^{[2]}
a^{[2]} = relu(Z^{[2]})
Z^{[3]} = W^{[3]} * a^{[2]} + b^{[3]}
a^{[3]} = relu(Z^{[3]})
Z^{[4]} = W^{[4]} * a^{[3]} + b^{[4]}
\hat{y} = softmax(Z^{[4]})

assuming relu for the hidden layer activation and softmax for the output layer.

For two hidden layers, that reduces to…
a_0 == x_0
Z^{[1]} = W^{[1]} * a^{[0]} + b^{[1]}
a^{[1]} = relu(Z^{[1]})
Z^{[2]} = W^{[2]} * a^{[1]} + b^{[2]}
a^{[2]} = relu(Z^{[2]})
Z^{[3]} = W^{[3]} * a^{[2]} + b^{[3]}
\hat{y} = softmax(Z^{[3]})

Using the labels I inserted into your code that would be equivalent to…
forward prop for layer: 1
activation relu on layer: 1

forward prop for layer: 2
activation relu on layer: 2

forward prop for layer: 3
activation softmax on layer: 3

I think in the current forward function there is an extra ‘forward prop for layer: 3’ and an extra ‘activation relu on layer 3’

Let me know what you think

Great with initiative :slight_smile: To do it from scratch by yourself is the best way to learn according to me. Be sure to compute the derivatives of tanh and leaky relu correctly in you backprop method.

1 Like