Week 2 Community Contributions: Share Your Notes


This is not an Deeplearning or Coursera official post, but I thought it might be helpful to the community for people to share their notes and questions.

The topic for this post is MLS Week 2 which covers multiple linear regression, vectors and matrices, and how to implement gradient descent on multiple inputs.

Do you have any notes or major take aways you would like to share? What about questions you still have regarding these topics?

Have I made any errors here?

Hi Jesse, if you don’t mind I am being a little too strict about the symbols…

  1. Your matrix: Lecture uses one-based indexing (from i = 1 to i = m) for m samples. In labs, we mostly see zero-based (i=0 to i=m-1) for m samples. Also, we usually use n to denote the number of features. So, in zero-based indexing, a m \times n dataset would be represented by a matrix where the last element is x^{(m-1)}_{n-1}.

  2. Your top blue box: that x should be a vector.

  3. For the 2 vectors on the left & in their description: row vectors are horizontal and have the shape of (1, something)

  4. For the 3rd equation in “vectorized”: by default, when you are referring to a single sample, \vec{x} is a row vector, and by default, \vec{w} is also a row vector, so in the 3rd equation, the transpose T is not needed.

  5. Bottom right corner, the subscript w and the x inside the brackets are vectors for both equations.

  6. To be perfect, as you mentioned that you are taking the 0-th sample as example, then all x in the bottom 5 equations could carry the superscript (0)

Thanks for the feedback! I made some edits:

My remaining question is about your point #4.

… so in the 3rd equation, the transpose T is not needed.

If we do vector multiplication without the dot product, shouldn’t we still have to do \vec{w}^{T}\times\vec{x}?

1 Like

Hey Jesse!

I meant instead of your original \vec{w}^T\cdot\vec{x}+b, it should be \vec{w}\cdot\vec{x}+b for presenting the dot product of \vec{w} and \vec{x} because both of them are row vectors. Our discussion is based on the assumption that \vec{w} is a row vector.

If we did \vec{w}^T\vec{x}, first the row vector \vec{w} will be transposed into a column vector, and putting together a column vector before a row vector will result in a matrix, because the shape of a column vector is (n, 1) and that of a row vector is (1, n), so \vec{w}^T\vec{x} is actually a matrix multiplication that will gives us a matrix of (n, n).

However, if we think the other way around, if we tranpose not \vec{w} but \vec{x}, it’s different. For \vec{w}\vec{x}^T, we are doing a matrix multiplication of shapes (1, n) and (n, 1) which will result in a matrix of shape (1, 1) or effectively a scalar that is the same as the result of the dot product between \vec{w} and \vec{x}.

You might choose between \vec{w}\cdot\vec{x} or \vec{w}\vec{x}^T, but I think our courses use the former one when talking about only one sample. The former is a scalar, whereas the later is a matrix that has one scalar.

For example, if \vec{w} = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix} and \vec{x} = \begin{bmatrix} 4 & 5 &6 \end{bmatrix},

\vec{w}\cdot\vec{x} = 32
\vec{w}\vec{x}^T = \begin{bmatrix} 32 \end{bmatrix}
\vec{w}^T\vec{x} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \times \begin{bmatrix} 4 & 5 &6 \end{bmatrix} = \begin{bmatrix} 4 & 5 & 6 \\ 8 & 10 & 12 \\ 12 & 15 & 18 \end{bmatrix}


Please bare with me about the symbols again. I just want to make sure it’s consistent throughout your notes:

  1. ending subscripts are n-1: for the 1st row of the matrix, and the 2 vectors on the left
  2. starting subscripts are 0: for the 2 column vectors in bottom right
  3. size of matrix is m \times n: for beneath the matrix. e.g. if m=3, the indices are 0, 1, 2 and we still have 3 as the size. So in your grey box under the matrix, I would say the lengths are m and n instead.


Thanks Raymond,

I made those edits. I decided to move the formulas to a different note which I will add to this thread later. Thank you for all the help. I was more lost on indexing than I originally realized.


Very nice! We all took some time to get used to those things, afterall they were invented by somebody else.


Feature scaling

1 Like

Gradient descent

1 Like

I made some minor edits, adding some context to the different methods of scaling.

Corrected per @rmwkwok:

Hey Jesse, a mean normalized feature falls between -1 and 1, instead of 0 and 1, and it will have a mean of 0 too.

Feature engineering

1 Like

One of my big realizations in Week 2 is that I can get wrapped up in equations and confuse myself about how the linear regression model behaves throughout the linear regression algorithm process.

For example, the linear regression model (with one variable) is,


The model as it relates to predicting values from the training set is shown as,


Realizing that i is synonymous with, “something in the training set at i” was big for me. This opened my mind to the concept of “something outside of the training set.” Assuming that something can be either an x or y, we can conclude that any x (feature input) outside the training set is a candidate for prediction input. A corresponding y value in this context is realy a \hat{y} (model prediction).

For example, say our training set has the samples x = \begin{bmatrix}1\\ 2\\ 3\end{bmatrix} and values for i (zero-index) are (0, 1, 2) and implemented on the values for x are x^{(0)}=1, x^{(1)}=2, x^{(2)}=3. So, looking at the available values of x in the training set, x=1.5 is an outsider and,


This idea of outsider values carries over well to multiple linear regression where the model is,


The model as it relates to predicting values from the training set is shown,


Now our example includes X as a matrix \begin{bmatrix}1 & 7\\ 2 & 8\\ 3 & 9\end{bmatrix} where values for i are still (0, 1, 2) but every reference of X^{(i)} is a row vector \vec{x}^{(i)}=\begin{bmatrix}x^{(i)}_{0} & x^{(i)}_{1}\end{bmatrix}. So, X^{(2)} = \vec{x}^{(2)} = \begin{bmatrix}2 & 8\end{bmatrix}. Therefore, f_{\vec{w},b}(\vec{x}^{(2)}) in the model is,



f_{\vec{w},b}(\vec{x}^{(2)})=\vec{w}\cdot\begin{bmatrix}2 & 8\end{bmatrix}+b

The above will give us the prediction \hat{y}^{(2)} which aims to estimate the training set y^{(2)}. What would outsider values be here? Well, any values for x_{0} and x_{1} not found in the training set.

Hey Jesse, a mean normalized feature falls between -1 and 1, instead of 0 and 1, and it will have a mean of 0 too.

Another way to find hint on feature engineering is, for each feature x, your do a x-y scatter plot, see if it is linear, if so, use it as is, otherwise, it means it is curved, then you can try plotting y against x^2 instead, and see if it is linear, and repeat this step with higher order until it is linear.

Above is a very rough idea on how to find the right order. Obviously it works only when it is representable by a power of x, and if it is indeed a multiplication of more than 1 feature like your “area”, then this method will fail. Although you can expand method to multiple variables multiplied together, the number of trials will increase exponentially, and it may not worth the time.

So, to make a good use of this way, it’s not about brute-forcing the trials to go through all possibilities, but hopefully your experience will be accumulated throughout this kind of “exercise” and your experience will guide you better to find the right combination in less trails.

As you said before, we are all training our own model in our brain to do/learn ML better.

Nice observation between “insider” & “outsider” samples. We have this concept also in ML. When we train a model, we want to validate if our choice of model is good. To do so, we split our dataset into the training set and the cv set. We train the model on training set which is the insider samples; and then we validate the model on cv set which is the outsider samples.

After we validated that our choice of model is OK, we might want to literally evaluate our completed model, and here we will need a test set. To have a test set, one way is to split the dataset into training dataset and test set, and gets our training set and cv set from the training dataset. So in this case, we are effectly splitting our whole dataset into 3 subsets- training set is the only insider set visible to the training process; cv set is the first outsider set for validating the choice of model; test set is the second outsider set for evaluation.

What do you mean by “trail” here?

Oh I meant “trials”. By a trial, I mean to see if an index power e.g. (x^2) is linear to the label y.

1 Like