Share your summary of MLS Course 1

Hi all,

I thought it would be a good exercise for me to summarize what I have learned in MLS Course 1. You might find this helpful to so I encourage you to share your own summary. Maybe we can give each other feedback.

I will post this as the first post in the topic and then begin working on my summary.

3 Likes

it looks good so far, @jesse

thanks for summarising @jesse
I would add a bit on improving the model:
underfit (high bias) → train with larger network (more nodes or layers), or try different algorithms, or train longer, etc…
overfit (high variance) → add regularizations, or more data can help

1 Like

Well, looks like I cannot edit my first post each time since it times out so I will delete and repost as needed?


In MLS Course 1 I learned about two kinds of supervised machine learning problems, (1) linear regression and (2) classification. Here, supervised means that the learning algorithm had labels provided.

Linear Regression

univariate linear regression means only one input feature x
General Notation — f_{w,b}(x)=wx+b
Training Set Prediction Notation — f_{w,b}(x^{(i)})=wx^{(i)}+b
multiple linear regression means multiple input features \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix} where n is the number of features
General Notation — f_{w,b}(x_1, x_2, ..., x_n)=w_1x_1+w_1x_1+...+w_nx_n+b
Vectorized Notation — f_{\vec{w},b}(\vec{x})=\vec{w} \cdot \vec{x} + b
Training Set Prediction Notation — f_{\vec{w},b}(\vec{x^{(i)}})=\vec{w} \cdot \vec{x^{(i)}} + b

The goal here is to fit a line or curve to a dataset (i.e., model) that can be used to predict “reasonable” values for new values of x (in the context of univariate) or \vec{x} (in the context of multiple).

With this goal in mind, the question is, “How do we accomplish this goal?”

  • Define what it means to have a better or worse prediction — cost function
  • Define a goal for getting better predictions — minimize the cost function
  • Determine how we will minimize the cost function — gradient descent

Cost Function

Note: From here on out I will only reference multiple linear notations.

In the context of our linear regression problem, we use the squared error cost function which looks like this:

J(\vec{w},b) = \frac{1}{2m} \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2
This can be summarized as:

The cost of our model given specific values for the parameters w and b

It is important to note that this cost is in the context of all outcomes of the model on the training set.

Minimizing Cost

Minimizing the cost of our model means minimizing our cost function J(\vec{w},b). In other words, trying to get our model to do the very best it can at predicting “acceptable” values for new values of \vec{x}

Gradient descent

Gradient descent is a method of iterating over values for the parameters of the model until a specific goal is satisfied. That goal is convergence. Convergence here means “at a local minimum cost” where the parameters no longer change much with each additional step.

Gradient descent works well with convex functions. The squared error cost function produces this shape.

The parts of the gradient descent equation that stick out to me are:

  • \alpha — the learning rate (changing this gives bigger and smaller steps)
  • the gradient — what the slope is of the derivative for a parameter (e.g., w_j) given the corresponding cost function
  • it is ideal to graph the cost function with each step of gradient descent to ensure the cost is going down and at a reasoable rate

Once the goal of gradient descent is satisfied (i.e., convergence) then that model with those specific parameters is considered ready to produce “acceptable” predictions.

Improving Gradient Descent

Feature Scaling

When your features \vec{x} are of different scale (e.g., x^{(i)}_1=1000 while x^{(i)}_2=.01) gradient descent can take a long time to converge. Scaling these features can help overcome this problem. The three common ways to do this are:

  1. scaling → divide a given x_j by its corresponding max value normalizing all \vec{x} between 0 and 1
  2. mean normalization → \frac{x_j - \mu_j}{x_{j \text{ range}}} normalizing values between -1 and 1 with a mean of 0

Improving The Model

So you have a model, you have a cost function, you minimize the cost function using gradient descent, and your predictions are not that great… Now what?

underfit

is when a model has high bias

overfit

is when a model has high variance

3 ways to improve

  1. More training data
  2. Different features or better engineered features can help (e.g., removing features)
  3. Regularization which reduces the size / impact (i.e., weight) of parameters
Regularization

The process of reducing the impact of lesser input features on predictions without removing any of them from your learning algorithm. This is applied at the level of the cost function which in turn has an impact on the calculation of your gradient during gradient descent.

Note: You do not need to apply regularization on the parameter b.

\frac{\lambda}{2m}\sum\limits_{j = 1}^{m}w_j^{2}
where if,
\lambda= close to 0, the model will underfit (basically eliminating the parameters \vec{w}
or if,
\lambda= large number, the model will overfit

Regularization (w/ Squared Error Cost)

J(\vec{w},b) = \frac{1}{2m} \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2+\frac{\lambda}{2m}\sum\limits_{j = 1}^{m}w_j^{2}

Regularization (w/ Squared Error Cost in Gradient Descent)

\frac{\partial J(\vec{w},b)}{\partial w_j} = \frac{1}{m} \sum\limits_{i = 1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})x_{j}^{(i)} + \frac{\lambda}{m} w_j

Logistic Regression (Binary Classification)

This is a valid approach when your y is binary, 0 or 1. This methodology uses the sigmoid function which produces values between 0 and 1.

Sigmoid function → g(z)=\frac{1}{1+e^{-z}}

We use this function by applying it to our existing linear model where z=f_{\vec{w},b}(\vec{x}) = \vec{w}\vec{x} + b or vectorized as \vec{w} \cdot \vec{x} + b. So logistic regression looks like,

Logistic regression → f_{\vec{w},b}(\vec{x}) = g(\vec{w} \cdot \vec{x} + b) = \frac{1}{1+e^{-(\vec{w} \cdot \vec{x} + b)}}

The above equation produces values between 0 and 1 which can be thought of as:

The probability y will be 1

This reality is not really binary though, at least not until we apply something called the decision boundary which is a heuristic threshold (a custom rule). The most common rule is for values of f_{\vec{w},b}(\vec{x}) \geq 0.5 predict 1 else 0.

We happen to know that f_{\vec{w},b}(\vec{x}) \geq 0.5 when z \geq 0 (i.e., \vec{w} \cdot \vec{x} + b \geq 0). This is better understood in a picture which won’t be drawn here.

Cost Function

Something important to consider here is that we need a cost function that captures the unique characteristics of our prediction model outputting only 0 or 1 AND satisfies that results of the cost function create a “convex” function. Unfotunately, sqaured error cost function does not work here (it creates many local minima). This introduces a new concept of loss function which will be combined with a similar (not same) cost function as squared error cost function.

The new cost function looks like this,

J(\vec{w},b) = \frac{1}{m} \sum\limits_{i = 1}^{m} \left[ loss(f_{\vec{w},b}(\vec{x}^{(i)}), y^{(i)}) \right]

Loss Function

The loss function is applied for each iteration i of the training set and looks like this,

loss(f_{\vec{w},b}(\vec{x}^{(i)}), y^{(i)}) = -y^{(i)} \log\left(f_{\vec{w},b}\left( \vec{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\vec{w},b}\left( \vec{x}^{(i)} \right) \right)

While this looks overwhelming, think about our specific cases for y, that is, when it is equal to 0 or 1. This function penalizes the model heavily when it is further from the truth (e.g., y^{(i)}=1 and the model predicts 0).

Convenient Reality

Other concepts from linear regression carry over like feature scaling, regularization, and even gradient descent is represented the same for parameters w_j. However, while gradient descent is represented the same way, remember that the cost function behind the scenes and f_{\vec{w},{b}}(\vec{x}^{(i)}) is actually the sigmoid function.


Major Takeaways

  • Seeing i typically means we are talking about iterating through the training set. So in the context of prediction, i=2 means we are talking about the model’s prediction given the values of x at training set sample 2

I will come back and finish this.

4 Likes

Hi @jesse,

Screenshot from 2022-07-21 10-15-01

I think it would be better to say, in an originally overfit model, setting lambda close to zero won’t help defeating overfitting and the model stays overfit, and with a lambda that is too large, the model would become underfit.

Screenshot from 2022-07-21 10-17-19

Well said.

I enjoy reading this summary. Thank you for sharing it.

Cheers,
Raymond

Hi I am new to machine learning.
“is when a model has high bias”
Explain the above line.

Hi @jesse,

Screenshot from 2022-07-24 10-33-44

This comment is wrong. No summation is missing. I have removed it from my above reply.

Raymond

It looks like you left out the 3rd method of Feature Scaling - z-score normalization:
https://www.coursera.org/learn/machine-learning/lecture/akapu/feature-scaling-part-2

Thanks for sharing your summary, @jesse ! Very helpful.

I am the same like you – I made a summary, was also thinking of sharing it here with others in case they’d find it helpful, and then I saw your post!

The summary touches on key topics from the course. Outside of serving as reminders, they will also help me never lose sight of the fundamentals. You can find my summary here.

The notes are nice and pretty practical oriented. Thank you for sharing them with us, @christiney! :raised_hands:

Cheers,
Raymond

1 Like

Thanks, @rmwkwok ! Excited to make progress with the second course and keep learning!