I am very confused with the matrix dimensions. Very very confused that last night I’ve dreamt of solving Matrixes.
In MLS, Andrew used m = number of samples (rows of X) and n = number of features (columns of X). But in DLS, he used n^{0} = number of samples (rows of X) and m = number of features (columns of X)
OK, good to go.
I go through the video on this topic multiple times and so far, I learned that vectorized dimensions should be:
W1 = (number of neurons of current layers, number of rows of the X/input)
b1 = (number of neurons of current layers, 1)
This gives Z1 and A1 = (number of neurons of current layers, number of columns of the X/input)
W2 = (number of neurons of current layers, number of neurons of previous layers [this is the number of rows of W1])
b2 = (number of neurons of current layers, 1)
Z2 and A2 = (number of neurons of current layer, number of columns of the A1)
and so on. Example: If X = (20,1), Y = (20,1) and no. of neurons in the first hidden layer is 7 then W1 should be (7,20) and b1 = (7,1). This gives Z1 and A1 = (7,1)

First confusion: My above concept is correct or not? The Video link and screenshot from that video are attached.

Second confusion: For binary classification problems, number of neurons in the last layer should be 1 as we have only 0 or 1 as output. That makes W2 (1, number of neurons of previous layer). However, if we have regression problems, output is any number then vectorized AL (y^{hat}) should be the same size as actual output (y) or not? In the above example, Y is (20,1) matrix, regression problem, so AL should be the same size or not?

Kindly guide me about all this. I will be highly indebted to you.
Thanks in Advance.
Saif.

DLS came first and then MLS was updated, so there’s chance that the convention of notations by Andrew might have changed in MLS. While taking DLS, I’d recommend to not connect it MLS because of that reason, and instead follow the conventions as mentioned in DLS for the purposes of quizzes and assignments.

Additionally, going over the Standard notations for Deep Learning, found here, will help you better understand these concepts.

shape of any W and any b has nothing to do with the # samples or # columns in X. (Shouldn’t this be reasonable? Because otherwise we can’t just input any number of samples for prediction)

X, A, Z always share the same # columns. (A sample is always represented in the same column number, that’s cool isn’t that?)

Can you match out all the shapes (of W1 b1 A1 Z1 W2 b2 A2 Z2) given the help of the example in my screen shot, and my above two points?

For your 2nd question, I hope we can discuss after we are done with the 1st question.

This seems reasonable but, in your screenshot, (as well as mine), the shape of W depends on the # samples or # rows in X. In your screenshot, the shape of X or a^{[0]} is (3,1) and W^{[1]} is = (4,3). 4 is the neurons of the hidden layer but 3 is the number of rows (samples) of X. Prof. Andrew also has written the size of X or a^{[0]} beneath it as (3,1). 3 is rows and 1 is a column. Right? This same is true for W^{[2]} which is (1,4); and 4 is, again, the number of rows of input (a^{[1]}). Even if we transpose W, it still depends on the number of rows of input.

Confusion. Confusion. Confusion. Is the size of W depend on rows (samples) or columns (features) of the input? It seems reasonable to depend on the features (columns) but both screenshots (yours and mine) contradict that.

Regarding Z and A, I understood them except the last A (output) for the regression problem.

Kindly correct me if I get that sentence wrong. Do you mean that the number of rows of final output A^{L} should be the same as the number of rows of X or a^{0} (initial input)?

Thank you, Raymond, for your time. I highly appreciate it.
So, the size of W1 depends on the number of rows of X, right?
And just one last clarification. Do the number of samples and number of examples (m) represent the same thing (number of rows)? And number of features = number of columns? Right?
After that, we can discuss my second confusion if your schedule allows it.

I am confused here. For binary, it is 1; but for regression, it should be 1 or equal to something else. The output in regression is not just 0 or 1 but many different numbers, so, I think, in vectorized form, it should be equal to # samples = # examples = # data points = # any other name = m = # columns (in DLS). Am I right?

Oh, sorry for this vague notation. Here, it means the output of the last layer.

Yes, in DLS that is the way it works. But I think the fundamental thing that is confusing you is the orientation of X. As Raymond has described, each column of X is one “sample” input vector. It is a column vector with n_x elements, which is the number of input features. So that is the number of rows of X: the number of rows is the number of features, not the number of samples. It is the number of columns that is the number of samples, m, right? And as Raymond shows above, that is true for all the Z^{[l]} and A^{[l]} values at all layers: they are n^{[l]} x m.

In a regression problem, the output is still a single number, it’s just that it’s a continuous number and not a “yes/no” answer. Technically we could represent the output of a binary classifier with a single bit, but we end up using a float value in both the classification and regression cases. But the point is that n^{[L]} is one in either case.

Aha, now I got it. W depends on the number of features of the input. It is on us whether to put the number of features in rows or in columns. In DLS, it is in rows; but in MLS it is columns.
Got it, sir. Thanks a million. You also clarified my confusion about the regression problem’s output. Currently, I am facing some errors, but I will try to solve them by myself.