Gradient Descent on m Examples - Neural Networks Basics | Coursera

Vimbi_Viswan · January 16, 2022, 8:30am

In the algorithm , the first step is to find z(i) = w(transpose) X + b
What is the difference between the X denoted here and x1 and x2 used later in the algorithm?

Is it that x1 and x2 are two feature vectors? then what is X used to compute z(i)?

Rashmi · January 16, 2022, 11:57am

Hi @Vimbi_Viswan, in the given video, Prof Ng has tried to explain the functioning of the ML model where X is a given input variable and x1, x2 and so on till xn are the number of input features that along with other hyperparameters (w and b) help to compute the output variable (z) to learn a target function. This z in previous lectures was denoted as Y.

You can have a better idea through this link Coursera | Online Courses & Credentials From Top Educators. Join for Free | Coursera

Thanks.

paulinpaloalto · January 16, 2022, 5:48pm

The key is to understand the notation that Prof Ng uses. When he uses lowercase x with no subscripts, he means one input “sample” vector. So x is a column vector with dimensions n_x x 1, where n_x is the number of input features in each sample. Then the components of the x vector are written as x_1, x_2 … x_{n_x}.

Then when he wants to talk about multiple input samples at once, he uses capital X to denote a matrix where each column is one sample vector. There are m samples, so the dimensions of X are n_x x m. The advantage of using multiple samples at once in the matrix X is that we can take advantage of “vectorization” instructions of the cpu to make the computations as efficient as possible. So the way to write the linear activation calculation for Logistic Regression is this:

Z = w^T \cdot X + b

Where w is the weight vector and is also an n_x x 1 column vector. Note that when we get to Week 3 and “real” neural networks, the weight coefficients will become a matrix and Prof Ng chooses to orient that matrix such that the transpose is no longer necessary and the “linear activation” formula becomes this for the general layer l.

Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]}

For the first hidden layer, the notational convention is A^{[0]} = X, so the formula for the first layer really ends up being this:

Z^{[1]} = W^{[1]} \cdot X + b^{[1]}

Vimbi_Viswan · January 17, 2022, 6:28am

Dear Sir

Thanks indeed for the timely response. It is very much clearer now

Regards

Vimbi_Viswan · January 17, 2022, 6:27am

Thanks indeed for your timely response

frankpisu · January 18, 2022, 10:44am

Just wanted to get some clarification on this notation used by Prof. Ng on the video “Computing a Neural Network’s Output”. Here we are in the context of neural networks, and I think he should not have used the transpose operation, since each row in the weights matrix W refers to the weights between one unit in the l-th layer (indicated by the specific row) and each input (or units in the prev. layer).

Concretely, w_1^1 should be an 1x3 vector containing weights between unit 1 of layer 1 and the three inputs. x is the 3x1 feature vector; dimensions match and we do not need the transpose.

It seems like there is a mixture of contexts. This notation would be correct when talking about logistic regression (and I get that Prof. Ng is exploiting the fact that each unit in the layer is a logreg).

Am I wrong ?

Rashmi · January 18, 2022, 12:45pm

Hi @frankpisu, Please have a glance over what @paulinpaloalto sir has tried to explain in his above comment and you will definitely find the clarification. Thanks!

paulinpaloalto · January 18, 2022, 3:59pm

Hi, Francesco.

Yes, the notation looks a little odd at first glance, but (as you say) he is starting from exactly how he writes the Logistic Regression expresssions. In that case, he uses his standard convention that vectors are column vectors, which is why the transpose is required for the individual weight vectors in that case. So he starts by treating each node in the layer as a single LR instance. But then at 4:00 into the lecture he shows that now that the w_i^{[l]T} vectors are row vectors, he can stack them as rows to get the W^{[l]} matrix and the transpose is no longer required on the whole matrix. So he’s just being super precise about preserving the way he writes the LR expressions and then preserves the “row-wise” orientation of those transposed weight vectors to get the simpler way to write the W^{[l]} matrices for full Neural Nets.

Topic		Replies	Views
Week 2: w1 and w2 as inputs for logistic regression - Gradient Descent Neural Networks and Deep Learning	3	445	October 6, 2023
Shouldn't X be (m,n) matrix when vectorizing neural network Neural Networks and Deep Learning	2	599	March 5, 2022
Gradient descent for logistic regression on m examples Neural Networks and Deep Learning week-2	1	30	August 6, 2024
Deep Learning questions Neural Networks and Deep Learning	5	735	October 6, 2022
C1_W2: Logistic Regression on m examples. (Error?) Neural Networks and Deep Learning	5	513	March 7, 2023

Gradient Descent on m Examples - Neural Networks Basics | Coursera

Related topics