Hand Written Image Recognition From Scratch


I have been successful with a 2 layer network, but whenever I increase the network size with additional layers, my predictor always predicts ‘1’.

I don’t know whether the mistake is coming from the forward prop or backward prop.

Please see my code for a 3 layer network:

def forward_prop(X_train, W1, b1, W2, b2, W3, b3):
Z1 = np.dot(W1, X_train) + b1
A1 = ReLU(Z1)

Z2 = np.dot(W2, A1) + b2
A2 = ReLU(Z2) 

Z3 = np.dot(W3, A2) + b3
A3 = softmax2(Z3) 

return Z1, A1, Z2, A2, Z3, A3 

def back_prop(X_train, Y_train, Z1, A1, Z2, A2, W1, W2, W3, A3):
m = Y_train.size
one_hot_Y = one_hot(Y_train)

dZ3 = A3 - one_hot_Y
dW3 = 1/m * np.dot(dZ3,A2.T)
db3 = 1/m*np.sum(dZ3, axis=1, keepdims=True)

dZ2 = np.dot(W3.T, dZ3)* deriv_ReLU(Z2)
dW2 = 1/m * np.dot(dZ2,A1.T)
db2 = 1/m*np.sum(dZ2, axis=1, keepdims=True)
dZ1 = np.dot(W2.T, dZ2)* deriv_ReLU(Z1)
dW1 = 1/m * np.dot(dZ1,X_train.T)
db1 = 1/m*np.sum(dZ1, axis=1, keepdims=True)

return dW1, db1, dW2, db2, dW3, db3


I do not know how to compute the cost function. If anyone can help that would be great

Hi, Matthew.

It’s great that you are doing this experiment! You always learn useful things when you apply what you’ve learned to a new problem.

I actually did a similar exercise using the MNIST dataset, but it was a couple of years back. I’ll try to remember how I approached it.

It turns out that softmax is the multiclass generalization of sigmoid and the math is very similar. To understand more and see the derivation of the cost functions, what better way to start than watching Prof Geoff Hinton’s lecture on that subject! :nerd_face: Please have a look and then we can discuss more if you like.

One other general suggestion: using the “hard-coded” style of implementation that they show here in Course 2 is a bit limiting. Anytime you want to try something different, you have to change the code. We already have the fully general implementation from Course 1 Week 4. It’s a little more work to start with that, but it gives some great benefits: if you want to try a different number of layers, you just make a new layers_dims array and you’re good to go. You would just need to add the logic to include softmax as one of the activation choices and deal with the differences in the loss function. Just a thought. Of course it’s a completely reasonable method to start by getting things to work with the “hard-coded” approach and then make the decision later to go with “full generality”. You’ll have all the component parts you need. And now that I think about it, it’s easier to debug with the hard-coded approach, since you don’t have all the layers of subroutines to deal with.

Before starting to debug the question of why your predictions are always 1, it makes sense to get the cost logic sorted. Then see where that gets you.


Thank you very much for the advice. I was finally able to find the problem, which was caused during initialisation - my weights were far too small.

I have another question though:

def softmax (X):
expo = np.exp(X)
expo_sum = np.sum(np.exp(X))
return expo/expo_sum

def softmax2(z):
s = np.exp(z)/sum(np.exp(z))
return s

softmax2 out performs softmax. The only difference is sum vs np.sum. i was under the impression that np.sum was correct but for some reason sum provides a FAR greater accuracy when trainning

I would have to believe that “sum” is somehow an alias for np.sum. The input is a numpy array, so there’s no way normal python sum() will work on that.

My guess is that the real issue is that you are being careless here by not specifying the axis on which you want the sum. I believe that means you are getting a scalar output from the np.sum case, which is definitely an error. Maybe the aliasing that is going on is smarter than just straight np.sum and is somehow giving you the correct axis value.

I suggest you print the intermediate values of np.sum(np.exp(Z)) and sum(np.exp(Z)) and compare them. If the two give different results, seems like that would be the obvious next step to investigate why they are different.

Python is an interactive language. You don’t have to wonder what something might do. You can try it and see. Watch this:

Z = np.random.randn(3,4)
print(f"Z = {Z}")
Znpsum = np.sum(Z)
print(f"Znpsum = {Znpsum}")
Zsum = sum(Z)
print(f"Zsum = {Zsum}")
Znpsumaxis = np.sum(Z, axis = 0, keepdims = True)
print(f"Znpsumaxis = {Znpsumaxis}")

Running that gives this result:

Z = [[ 0.49671415 -0.1382643   0.64768854  1.52302986]
 [-0.23415337 -0.23413696  1.57921282  0.76743473]
 [-0.46947439  0.54256004 -0.46341769 -0.46572975]]
Znpsum = 3.5514636706048432
Zsum = [-0.20691361  0.17015879  1.76348366  1.82473483]
Znpsumaxis = [[-0.20691361  0.17015879  1.76348366  1.82473483]]

So you just got lucky: plain sum turns out to be an alias for np.sum with axis = 0, which is what you want here. But I think you’d be better off writing it explicitly and also including the keepdims just to be sure you get what you want. The point being that in the above output Zsum and Znpsumaxis are not the same: they have the same values, but the first is a 1D array and the second is a 2D array. It’s a bit subtle, but you have to train yourself to pay attention to the square brackets. Those are significant.