I am getting an error with the backpropagation in line : dz2 = y - (a2) as
ValueError: Length of values (10) does not match length of index (42000)
but when when i try with print((np.random.randn(42000,)-np.random.randn(42000,10).T))
i dont have any problems.

import pandas as pd
from pandas import DataFrame
import numpy as np
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
y = train_data["label"]
x = train_data.drop("label",axis=1)
train_data = np.array(train_data)
#create a neural network with total 4 layers - 1 input 2 hidden and 1 output
input_layer = 784
layer_1 = 10
output_layer = 10
np.random.seed(0)
weights1 = np.random.rand(input_layer,layer_1)
bias1 = np.random.rand(layer_1)
weights2 = np.random.rand(layer_1,output_layer)
bias2 = np.random.rand(output_layer)
def relu(x):
return np.maximum(x,0)
#forward propagation
z1 = np.dot(x,weights1) + bias1
a1 = relu(z1)
z2 = np.dot(z1,weights2) + bias2
a2 = relu(z2)
#backward_propagation
a2 = a2.T
print(a2.shape)
dz2 = y - (a2)
dw2 = (1/m)*dz2*a1

shape of y is (42000,) and shape of a2 is (10,42000)
when i tried print((np.random.randn(42000,)-np.random.randn(42000,10).T)) it didnt have any problems

Note that there is a difference between (42000, 1) and (42000, ) for the shape. The first one, (42000, 1), has 42000 rows and 1 column. The second one, (42000, ) can be thought of as a 1-dimensional vector of size 42000.

Maybe Iâ€™m just missing something here, but it seems to me that the first thing to do here is explain the meaning of your data, both the x and y values. What is the result you want to generate here? Note that the output layer of your network is 10 units with a ReLU activation. So youâ€™ll get 10 distinct positive numbers (between 0 and +\infty) as the prediction output of your network. But your y values appear to be only 1 value per sample.

So what is going on here? Is this a regression problem with 10 numeric outputs? Or is it a classification problem with 10 output classes? If the former, then how does that map to the y values that you are starting with? If the latter, then I assume that the y values are just the categorical representation of the 10 output classes.

Of course whether this is a regression problem or a classification problem also is relevant to figuring out what the loss function needs to be.

But it would be nice to clearly understand all that before we go further here.

well currently i am only trying to implement using only numpy so for now i am just trying to match the shapes etc for broadcasting. The equation dz2 = y - a2.T is supposed to be dz2 = a2 - y the one used by Andrew. as for the loss i will use the cross entropy log loss the same as andrew in the course.

I am using the MNIST dataset to classify digits(0-9). x column is the pixel values for each pixel in the image(28*28) and x row is the 42000 images. here i have removed the label and assigned it to y which contains the digits corresponding to each row.
The output layer I used as 10 units so that I could classify the 10 digits or the one unit with maximum value would be the corresponding digit.

Its a classification problem.

I hope this answers your queries. Anything I should add ?

OK yea the equation appears to be incorrect, but I am assuming you changed it since youâ€™re just trying to debug whatâ€™s wrong? I think if youâ€™re just trying to figure out why your broadcasting isnâ€™t working, you can try to remove the transpose from a2 (the â€ś.Tâ€ť) and that might just fix your issue.

@paulinpaloaltoâ€™s explanation got to the root of the problem. I think the problem here is that there are multiple types of classification, and you need to understand which one youâ€™re using. Either way, the shape of y and your model output must match, and you shouldnâ€™t rely on broadcasting to make them match.

If your label, y, is only one number (I am assuming from 0 to 9?), the output of your model needs to be one number as well. The loss in this case would be SparseCategoricalCrossentropy, and you can do some research on that. I donâ€™t know exactly the back propagation formulas of that off the top of my head, and you might need to look it up.

On the other hand, you can convert your label, y, to 10 numbers (of 0s and 1s) using something called one-hot encoding. In that case, the output of your model can be 10 numbers as well, and you will likely want to apply a softmax activation function in the end (rather than relu) to convert those numbers to probabilities. The loss would be CategoricalCrossentropy, and the associated back propagation formula may be different as well.

If youâ€™re looking to implement back propagation for classification just for fun and get a better understanding, I would recommend using a dataset that just does a binary classification (with just 2 possible outputs, rather than 10 possible outputs), which would likely make the back propagation formulas easier to deal with.

Now that we know that it really is a 10 class classification, then the output of the model that you feed to the loss function always needs to be a 10 neuron softmax output, because thatâ€™s how cross entropy loss works. You then have to add a later â€śpredictâ€ť step that just does â€śargmaxâ€ť on the softmax output to get the categorical class. As you say, you can choose to leave the y values in categorical (one value) form and then use the â€śsparseâ€ť version of the TF loss function. That has the internal logic to do the one hot conversion and just saves you the work of writing that code (even though itâ€™s just one TF function). Or if you prefer, you can â€śshow your workâ€ť and manually do the one hot conversion in your code and then use the plain vanilla version of the cross entropy multiclass loss function.

But since your goal is to do this all directly in numpy, youâ€™ll need to write the softmax function and manually implement the â€ś10 wayâ€ť version of cross entropy loss and also write the logic to do the â€śone hotâ€ť conversion on your y values. Well, I guess you could write the cross entropy logic with a for loop instead of vectorized and use your y values to index the \hat{y} vector, but that will be a lot more expensive in terms of cpu time. Itâ€™s a much better strategy to just write the one hot conversion as a â€śpreprocessingâ€ť step. That might end up being a loop, but the point is you only do it once. The processing of the cost function happens a lot more frequently than that. Although another point to consider is that you donâ€™t actually need to compute the J value on every iteration of training: you really only need the gradients of J to implement back propagation and those are just formulas.

Oh yes, my mistake. Thanks for pointing that out @paulinpaloalto. For SparseCategoricalCrossentropy, the y should be one number, but the model output should be 10 numbers (of probabilities, using softmax too).

when i remove the .T i get ValueError: operands could not be broadcast together with shapes (42000,) (42000,10) but still the question i cant answer is why i am getting the error when i have the .T but not when i print the arrays with the same shape plus i dont understand what the error message ValueError: Length of values (10) does not match length of index (42000) means.

The reason it didnâ€™t work is because y is a panda Series, whereas a2 is a numpy array. If you do y = np.array(y) before the line dz2 = y - (a2), then it wonâ€™t throw an error.

With that said, that is not the correct equation and you shouldnâ€™t use it. Just following up to figure out the error behind the broadcasting issue.