Share your summary of MLS Course 1

jesse · July 19, 2022, 8:01pm

Hi all,

I thought it would be a good exercise for me to summarize what I have learned in MLS Course 1. You might find this helpful to so I encourage you to share your own summary. Maybe we can give each other feedback.

I will post this as the first post in the topic and then begin working on my summary.

rmwkwok · July 20, 2022, 2:27am

it looks good so far, @jesse

sangdinh · July 20, 2022, 5:20am

thanks for summarising @jesse
I would add a bit on improving the model:
underfit (high bias) → train with larger network (more nodes or layers), or try different algorithms, or train longer, etc…
overfit (high variance) → add regularizations, or more data can help

jesse · July 20, 2022, 9:48pm

Well, looks like I cannot edit my first post each time since it times out so I will delete and repost as needed?

In MLS Course 1 I learned about two kinds of supervised machine learning problems, (1) linear regression and (2) classification. Here, supervised means that the learning algorithm had labels provided.

Linear Regression

univariate linear regression means only one input feature x
General Notation — f_{w,b}(x)=wx+b
Training Set Prediction Notation — f_{w,b}(x^{(i)})=wx^{(i)}+b
multiple linear regression means multiple input features \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix} where n is the number of features
General Notation — f_{w,b}(x_1, x_2, ..., x_n)=w_1x_1+w_1x_1+...+w_nx_n+b
Vectorized Notation — f_{\vec{w},b}(\vec{x})=\vec{w} \cdot \vec{x} + b
Training Set Prediction Notation — f_{\vec{w},b}(\vec{x^{(i)}})=\vec{w} \cdot \vec{x^{(i)}} + b

The goal here is to fit a line or curve to a dataset (i.e., model) that can be used to predict “reasonable” values for new values of x (in the context of univariate) or \vec{x} (in the context of multiple).

With this goal in mind, the question is, “How do we accomplish this goal?”

Define what it means to have a better or worse prediction — cost function
Define a goal for getting better predictions — minimize the cost function
Determine how we will minimize the cost function — gradient descent

Cost Function

Note: From here on out I will only reference multiple linear notations.

In the context of our linear regression problem, we use the squared error cost function which looks like this:

J(\vec{w},b) = \frac{1}{2m} \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2
This can be summarized as:

The cost of our model given specific values for the parameters w and b

It is important to note that this cost is in the context of all outcomes of the model on the training set.

Minimizing Cost

Minimizing the cost of our model means minimizing our cost function J(\vec{w},b). In other words, trying to get our model to do the very best it can at predicting “acceptable” values for new values of \vec{x}

Gradient descent

Gradient descent is a method of iterating over values for the parameters of the model until a specific goal is satisfied. That goal is convergence. Convergence here means “at a local minimum cost” where the parameters no longer change much with each additional step.

Gradient descent works well with convex functions. The squared error cost function produces this shape.

The parts of the gradient descent equation that stick out to me are:

\alpha — the learning rate (changing this gives bigger and smaller steps)
the gradient — what the slope is of the derivative for a parameter (e.g., w_j) given the corresponding cost function
it is ideal to graph the cost function with each step of gradient descent to ensure the cost is going down and at a reasoable rate

Once the goal of gradient descent is satisfied (i.e., convergence) then that model with those specific parameters is considered ready to produce “acceptable” predictions.

Improving Gradient Descent

Feature Scaling

When your features \vec{x} are of different scale (e.g., x^{(i)}_1=1000 while x^{(i)}_2=.01) gradient descent can take a long time to converge. Scaling these features can help overcome this problem. The three common ways to do this are:

scaling → divide a given x_j by its corresponding max value normalizing all \vec{x} between 0 and 1
mean normalization → \frac{x_j - \mu_j}{x_{j \text{ range}}} normalizing values between -1 and 1 with a mean of 0

Improving The Model

So you have a model, you have a cost function, you minimize the cost function using gradient descent, and your predictions are not that great… Now what?

underfit

is when a model has high bias

overfit

is when a model has high variance

3 ways to improve

More training data
Different features or better engineered features can help (e.g., removing features)
Regularization which reduces the size / impact (i.e., weight) of parameters

Regularization

The process of reducing the impact of lesser input features on predictions without removing any of them from your learning algorithm. This is applied at the level of the cost function which in turn has an impact on the calculation of your gradient during gradient descent.

Note: You do not need to apply regularization on the parameter b.

\frac{\lambda}{2m}\sum\limits_{j = 1}^{m}w_j^{2}
where if,
\lambda= close to 0, the model will underfit (basically eliminating the parameters \vec{w}
or if,
\lambda= large number, the model will overfit

Regularization (w/ Squared Error Cost)

J(\vec{w},b) = \frac{1}{2m} \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2+\frac{\lambda}{2m}\sum\limits_{j = 1}^{m}w_j^{2}

Regularization (w/ Squared Error Cost in Gradient Descent)

\frac{\partial J(\vec{w},b)}{\partial w_j} = \frac{1}{m} \sum\limits_{i = 1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})x_{j}^{(i)} + \frac{\lambda}{m} w_j

Logistic Regression (Binary Classification)

This is a valid approach when your y is binary, 0 or 1. This methodology uses the sigmoid function which produces values between 0 and 1.

Sigmoid function → g(z)=\frac{1}{1+e^{-z}}

We use this function by applying it to our existing linear model where z=f_{\vec{w},b}(\vec{x}) = \vec{w}\vec{x} + b or vectorized as \vec{w} \cdot \vec{x} + b. So logistic regression looks like,

Logistic regression → f_{\vec{w},b}(\vec{x}) = g(\vec{w} \cdot \vec{x} + b) = \frac{1}{1+e^{-(\vec{w} \cdot \vec{x} + b)}}

The above equation produces values between 0 and 1 which can be thought of as:

The probability y will be 1

This reality is not really binary though, at least not until we apply something called the decision boundary which is a heuristic threshold (a custom rule). The most common rule is for values of f_{\vec{w},b}(\vec{x}) \geq 0.5 predict 1 else 0.

We happen to know that f_{\vec{w},b}(\vec{x}) \geq 0.5 when z \geq 0 (i.e., \vec{w} \cdot \vec{x} + b \geq 0). This is better understood in a picture which won’t be drawn here.

Cost Function

Something important to consider here is that we need a cost function that captures the unique characteristics of our prediction model outputting only 0 or 1 AND satisfies that results of the cost function create a “convex” function. Unfotunately, sqaured error cost function does not work here (it creates many local minima). This introduces a new concept of loss function which will be combined with a similar (not same) cost function as squared error cost function.

The new cost function looks like this,

J(\vec{w},b) = \frac{1}{m} \sum\limits_{i = 1}^{m} \left[ loss(f_{\vec{w},b}(\vec{x}^{(i)}), y^{(i)}) \right]

Loss Function

The loss function is applied for each iteration i of the training set and looks like this,

loss(f_{\vec{w},b}(\vec{x}^{(i)}), y^{(i)}) = -y^{(i)} \log\left(f_{\vec{w},b}\left( \vec{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\vec{w},b}\left( \vec{x}^{(i)} \right) \right)

While this looks overwhelming, think about our specific cases for y, that is, when it is equal to 0 or 1. This function penalizes the model heavily when it is further from the truth (e.g., y^{(i)}=1 and the model predicts 0).

Convenient Reality

Other concepts from linear regression carry over like feature scaling, regularization, and even gradient descent is represented the same for parameters w_j. However, while gradient descent is represented the same way, remember that the cost function behind the scenes and f_{\vec{w},{b}}(\vec{x}^{(i)}) is actually the sigmoid function.

Major Takeaways

Seeing i typically means we are talking about iterating through the training set. So in the context of prediction, i=2 means we are talking about the model’s prediction given the values of x at training set sample 2

I will come back and finish this.

rmwkwok · July 21, 2022, 2:27am

Hi @jesse,

I think it would be better to say, in an originally overfit model, setting lambda close to zero won’t help defeating overfitting and the model stays overfit, and with a lambda that is too large, the model would become underfit.

Well said.

I enjoy reading this summary. Thank you for sharing it.

Cheers,
Raymond

Rishabh_Singh3 · July 21, 2022, 9:56pm

Hi I am new to machine learning.
“is when a model has high bias”
Explain the above line.

rmwkwok · July 24, 2022, 2:34am

Hi @jesse,

This comment is wrong. No summation is missing. I have removed it from my above reply.

Raymond

Yaakov_A_Sternberg · June 26, 2023, 9:00pm

It looks like you left out the 3rd method of Feature Scaling - z-score normalization:
https://www.coursera.org/learn/machine-learning/lecture/akapu/feature-scaling-part-2

christiney · September 13, 2023, 10:14am

Thanks for sharing your summary, @jesse ! Very helpful.

I am the same like you – I made a summary, was also thinking of sharing it here with others in case they’d find it helpful, and then I saw your post!

The summary touches on key topics from the course. Outside of serving as reminders, they will also help me never lose sight of the fundamentals. You can find my summary here.

rmwkwok · September 13, 2023, 11:36am

The notes are nice and pretty practical oriented. Thank you for sharing them with us, @christiney!

Cheers,
Raymond

christiney · September 13, 2023, 12:08pm

Thanks, @rmwkwok ! Excited to make progress with the second course and keep learning!

Topic		Replies	Views
Week 2 Community Contributions: Share Your Notes Supervised ML: Regression and Classification week-2	17	650	July 8, 2022
C1 W2 linear Regression Supervised ML: Regression and Classification week-2	2	521	December 11, 2022
What A Gorgeous Course! Supervised ML: Regression and Classification week-1	1	551	June 22, 2022
C1_W2_Linear_Regression ass Supervised ML: Regression and Classification week-2	1	616	September 12, 2022
Hey there! I'm a newbie into the world of Machine Learning Supervised ML: Regression and Classification week-1	5	364	January 2, 2024

Share your summary of MLS Course 1

Linear Regression

Cost Function

Minimizing Cost

Gradient descent

Improving Gradient Descent

Feature Scaling

Improving The Model

underfit

overfit

3 ways to improve

Regularization

Regularization (w/ Squared Error Cost)

Regularization (w/ Squared Error Cost in Gradient Descent)

Logistic Regression (Binary Classification)

Cost Function

Loss Function

Convenient Reality

Major Takeaways

Related topics