MLS : Regression and Classification : Cost function with regularization

ai_is_cool · February 19, 2025, 2:59pm

In the above video presentation at 1 min 52 secs into the video, Prof. Ng, explains that…

“So when you minimize this function, you are going to end up with W3 close to 0 and W4 close to zero.”

However I cannot see, even intuitively how he arrives at that conclusion up to this point in the lesson.

Its only when I compute the partial derivative of the cost function and include it in the expression for weight parameter updates that I can see how W3 and W4 both get close to zero. Unfortunately, Prof. Ng, doesn’t show this so far.

I think it might have been useful if Prof. Ng had said something like…“In the upcoming videos you will see how this is happens”.

Perhaps a “Reading Item” could be added to help other students in the future from not seeing how he arrives at this conclusion here but it becomes clear in future lessons?

Thanks,

Stephen.

conscell · February 19, 2025, 11:28pm

Hi @ai_is_cool,

w_j = w_j - \alpha\frac{1}{m}\sum_{i = 1}^{m}((f_{\vec w, b}(\vec x^{(i)}) - y^{(i)})x_j^{(i)})

I don’t see a regularization term in the equation you posted. Which course part are you referring to?

ai_is_cool · February 20, 2025, 9:55am

Hi @concell,

I am having a problem including the regularization term on my phone as the markup isn’t rendering. I will try on my Mac later.

ai_is_cool · February 20, 2025, 3:18pm

Hi @concell, I’ve deleted this post for now.

conscell · February 20, 2025, 11:52pm

@ai_is_cool,

Without computing the partial derivatives, it might not be immediately clear how the optimization process explicitly reduces w_3 and w_4. In the video you mentioned, Prof. Ng is relying on an intuitive understanding of how regularization works rather than deriving it mathematically at that moment. When you minimize the cost function J(\vec{w}, b) , i.e.

\min_{{\vec{w}, b}} \ \ {1 \over 2m} \sum_{i=1}^m(f_{\vec{w}, b}({\vec{x}^{(i)}}) -y^{(i)})^2 + 1000 \cdot w_3^2 + 1000 \cdot w_4^2,

the algorithm will prioritize keeping w_3 and w_4 small, because the coefficient 1000 in the regularization term is very large, meaning the penalty for large w_3 and w_4 is severe, even if it means sacrificing some accuracy on the training data. Next video Regularized linear regression provides more detailed explanation.

In machine learning, L_2 regularization is often referred to as weight decay because it explicitly reduces the magnitude of the weights during the optimization process.
You may also find the following thread helpful.

ai_is_cool · February 21, 2025, 10:13am

Hi @concell,

What is the term 1000 \cdot u?

What do you mean by “penalty”?

Also without including the full expression for weight updating, your expression does not minimise the cost function J(\vec w, b)

conscell · February 21, 2025, 12:38pm

@ai_is_cool,

What is the term 1000 ⋅ u?

I am reffering to this part of the lecture:

What do you mean by “penalty”?

In Machine Learning, “penalty” typically refers to a regularization term added to the loss function to discourage overfitting by controlling model complexity.

Also without including the full expression for weight updating, your expression does not minimise the cost function J(\vec{w}, b)

My expression is the objective function from the lecture you are referring to.

ai_is_cool · February 21, 2025, 1:33pm

Hi @concell,

I cannot find any reference to the term 1000 \cdot u in Prof. Ng’s video lesson.

Also I am unfamiliar with “dot-operator” being applied to a number argument and a variable argument, only between vector arguments like \vec a \cdot \vec b.

What do you mean by an “objective function”? What is that?

And what do you mean by “…controlling model complexity…”?

conscell · February 21, 2025, 2:08pm

Hi @ai_is_cool,

Glad to hear from you again!

I cannot find any reference to the term 1000 ⋅ u in Prof. Ng’s video lesson.

We are discussing the video lecture “Machine Learning Specialization-> Supervised ML: Regression and Classification → Week 3 → Cost function with regularization → 1:52,” which you can find by the following link: https://www.coursera.org/learn/machine-learning/lecture/UZTPk/cost-function-with-regularization

Also I am unfamiliar with “dot-operator” being applied to a number argument and a variable argument, only between vector arguments like \vec{a} \cdot \vec{b}.

I am unfamiliar with “dot-operator” either, but here “\cdot” means a scalar to scalar product because both 1000 and w_i are scalars.

What do you mean by an “objective function”? What is that?

It is J(\vec{w}, b)

And what do you mean by “…controlling model complexity…”?

When we talk about controlling model complexity, we mean preventing a machine learning model from becoming too simple (underfitting) or too complex (overfitting).

ai_is_cool · February 21, 2025, 2:47pm

Hi @concell,

I have watched and listened to the video again at 1 min 52 secs but I still cannot see or hear about the term 1000 \cdot u.

I think it might be clearer what you mean by avoiding using the dot-operator for vector operations and just write 1000u to mean simple multiplication of two numbers. Otherwise, it can be confusing and ambiguous.

ai_is_cool · February 21, 2025, 2:51pm

I think it is best to avoid using words and terms that Prof. Ng doesn’t use in his lessons - like “…objective function…” and “…controlling model conplexity…” as it can only lead to confusion and ambiguity.

Regards,

Stephen.

conscell · February 21, 2025, 9:58pm

@ai_is_cool
The term 1000 \cdot w_i follows standard notation where the dot represents scalar multiplication, which is commonly used in mathematical and programming contexts. However, I understand that notation preferences can vary. If you find 1000 w_i clearer, that’s totally reasonable.
Objective function is the most general term for any function that you optimize during training. Model complexity refers to the capacity of a model to fit data, i.e. a “simple” model such as the linear y = \theta_0 + \theta_1 x, or a more “complex” model such as the polynomial y = \theta_0 + \theta_1 x + \dots + \theta_5 x^5.
Prof. Ng used this terminology in CS229. Using different wording can sometimes help clarify ideas, especially for those who may have encountered these terms elsewhere. But I understand the importance of sticking to familiar terminology for consistency. Thanks for sharing your perspective!

ai_is_cool · February 21, 2025, 10:12pm

Thanks for that but I think it is important that only the words and nomenclature appearing in the course should be used to avoid ambiguity and lack of clarity.

Topic		Replies	Views
Week 1 Community Contributions: Share Your Notes Supervised ML: Regression and Classification week-1	37	1388	July 4, 2022
Derivative of regularization term Supervised ML: Regression and Classification week-3	22	1566	November 6, 2024
A doubt in C1_W3 Lecture Supervised ML: Regression and Classification week-3	4	433	June 3, 2023
Equate the derivative of cost to 0 zero to get the weight 'w' Supervised ML: Regression and Classification week-1	4	480	June 12, 2023
Calculation of partial derivative of the cost function for logistic regression Supervised ML: Regression and Classification week-3	60	175	February 25, 2025

MLS : Regression and Classification : Cost function with regularization

Related topics