Getting errors in backpropagation when creating from scratch

I am getting an error with the backpropagation in line : dz2 = y - (a2) as
ValueError: Length of values (10) does not match length of index (42000)
but when when i try with
i dont have any problems.

import pandas as pd
from pandas import DataFrame
import numpy as np

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

y = train_data["label"]
x = train_data.drop("label",axis=1)

train_data = np.array(train_data)

#create a neural network with total 4 layers - 1 input 2 hidden and 1 output
input_layer = 784
layer_1 = 10
output_layer = 10
weights1 = np.random.rand(input_layer,layer_1)
bias1 = np.random.rand(layer_1)
weights2 = np.random.rand(layer_1,output_layer)
bias2 = np.random.rand(output_layer)

def relu(x):
    return np.maximum(x,0)

#forward propagation 

z1 =,weights1) + bias1
a1 = relu(z1)
z2 =,weights2) + bias2
a2 = relu(z2)

a2 = a2.T
dz2 = y - (a2)
dw2 = (1/m)*dz2*a1

What is the shape of y? The error implies that the shape of y and a2 do not match.

1 Like

shape of y is (42000,) and shape of a2 is (10,42000)
when i tried print((np.random.randn(42000,)-np.random.randn(42000,10).T)) it didnt have any problems

1 Like

Can you do print(y.shape) and verify the shape?

Note that there is a difference between (42000, 1) and (42000, ) for the shape. The first one, (42000, 1), has 42000 rows and 1 column. The second one, (42000, ) can be thought of as a 1-dimensional vector of size 42000.

1 Like

With that said, I’m not sure that your backpropagation equation is correct. Which loss are you using? Where did you get the dz2 = y - a2.T equation?

1 Like

Maybe I’m just missing something here, but it seems to me that the first thing to do here is explain the meaning of your data, both the x and y values. What is the result you want to generate here? Note that the output layer of your network is 10 units with a ReLU activation. So you’ll get 10 distinct positive numbers (between 0 and +\infty) as the prediction output of your network. But your y values appear to be only 1 value per sample.

So what is going on here? Is this a regression problem with 10 numeric outputs? Or is it a classification problem with 10 output classes? If the former, then how does that map to the y values that you are starting with? If the latter, then I assume that the y values are just the categorical representation of the 10 output classes.

Of course whether this is a regression problem or a classification problem also is relevant to figuring out what the loss function needs to be. :nerd_face:

But it would be nice to clearly understand all that before we go further here.

1 Like

print(y.shape) gives (42000,)
and print(a2.shape) gives (42000,10)

1 Like

well currently i am only trying to implement using only numpy so for now i am just trying to match the shapes etc for broadcasting. The equation dz2 = y - a2.T is supposed to be dz2 = a2 - y the one used by Andrew. as for the loss i will use the cross entropy log loss the same as andrew in the course.

1 Like

I am using the MNIST dataset to classify digits(0-9). x column is the pixel values for each pixel in the image(28*28) and x row is the 42000 images. here i have removed the label and assigned it to y which contains the digits corresponding to each row.
The output layer I used as 10 units so that I could classify the 10 digits or the one unit with maximum value would be the corresponding digit.

Its a classification problem.

I hope this answers your queries. Anything I should add ?

1 Like

OK yea the equation appears to be incorrect, but I am assuming you changed it since you’re just trying to debug what’s wrong? I think if you’re just trying to figure out why your broadcasting isn’t working, you can try to remove the transpose from a2 (the “.T”) and that might just fix your issue.

@paulinpaloalto’s explanation got to the root of the problem. I think the problem here is that there are multiple types of classification, and you need to understand which one you’re using. Either way, the shape of y and your model output must match, and you shouldn’t rely on broadcasting to make them match.

If your label, y, is only one number (I am assuming from 0 to 9?), the output of your model needs to be one number as well. The loss in this case would be SparseCategoricalCrossentropy, and you can do some research on that. I don’t know exactly the back propagation formulas of that off the top of my head, and you might need to look it up.

On the other hand, you can convert your label, y, to 10 numbers (of 0s and 1s) using something called one-hot encoding. In that case, the output of your model can be 10 numbers as well, and you will likely want to apply a softmax activation function in the end (rather than relu) to convert those numbers to probabilities. The loss would be CategoricalCrossentropy, and the associated back propagation formula may be different as well.

If you’re looking to implement back propagation for classification just for fun and get a better understanding, I would recommend using a dataset that just does a binary classification (with just 2 possible outputs, rather than 10 possible outputs), which would likely make the back propagation formulas easier to deal with.

1 Like

Now that we know that it really is a 10 class classification, then the output of the model that you feed to the loss function always needs to be a 10 neuron softmax output, because that’s how cross entropy loss works. You then have to add a later “predict” step that just does “argmax” on the softmax output to get the categorical class. As you say, you can choose to leave the y values in categorical (one value) form and then use the “sparse” version of the TF loss function. That has the internal logic to do the one hot conversion and just saves you the work of writing that code (even though it’s just one TF function). Or if you prefer, you can “show your work” and manually do the one hot conversion in your code and then use the plain vanilla version of the cross entropy multiclass loss function.

But since your goal is to do this all directly in numpy, you’ll need to write the softmax function and manually implement the “10 way” version of cross entropy loss and also write the logic to do the “one hot” conversion on your y values. Well, I guess you could write the cross entropy logic with a for loop instead of vectorized and use your y values to index the \hat{y} vector, but that will be a lot more expensive in terms of cpu time. It’s a much better strategy to just write the one hot conversion as a “preprocessing” step. That might end up being a loop, but the point is you only do it once. The processing of the cost function happens a lot more frequently than that. Although another point to consider is that you don’t actually need to compute the J value on every iteration of training: you really only need the gradients of J to implement back propagation and those are just formulas.

1 Like

Oh yes, my mistake. Thanks for pointing that out @paulinpaloalto. For SparseCategoricalCrossentropy, the y should be one number, but the model output should be 10 numbers (of probabilities, using softmax too).

1 Like

when i remove the .T i get ValueError: operands could not be broadcast together with shapes (42000,) (42000,10) but still the question i cant answer is why i am getting the error when i have the .T but not when i print the arrays with the same shape plus i dont understand what the error message ValueError: Length of values (10) does not match length of index (42000) means.

1 Like

Can you tell me where you got your train.csv and test.csv? If so, I can run your code and try to figure out that broadcasting issue.

1 Like
1 Like

OK I did some debugging and figured it out.

The reason it didn’t work is because y is a panda Series, whereas a2 is a numpy array. If you do y = np.array(y) before the line dz2 = y - (a2), then it won’t throw an error.

With that said, that is not the correct equation and you shouldn’t use it. Just following up to figure out the error behind the broadcasting issue.


wow thanks a lot

1 Like