In the above video presentation at 1 min 52 secs into the video, Prof. Ng, explains that…
“So when you minimize this function, you are going to end up with W3 close to 0 and W4 close to zero.”
However I cannot see, even intuitively how he arrives at that conclusion up to this point in the lesson.
Its only when I compute the partial derivative of the cost function and include it in the expression for weight parameter updates that I can see how W3 and W4 both get close to zero. Unfortunately, Prof. Ng, doesn’t show this so far.
I think it might have been useful if Prof. Ng had said something like…“In the upcoming videos you will see how this is happens”.
Perhaps a “Reading Item” could be added to help other students in the future from not seeing how he arrives at this conclusion here but it becomes clear in future lessons?
Without computing the partial derivatives, it might not be immediately clear how the optimization process explicitly reduces w_3 and w_4. In the video you mentioned, Prof. Ng is relying on an intuitive understanding of how regularization works rather than deriving it mathematically at that moment. When you minimize the cost function J(\vec{w}, b) , i.e.
the algorithm will prioritize keeping w_3 and w_4 small, because the coefficient 1000 in the regularization term is very large, meaning the penalty for large w_3 and w_4 is severe, even if it means sacrificing some accuracy on the training data. Next video Regularized linear regression provides more detailed explanation.
In machine learning, L_2 regularization is often referred to as weight decay because it explicitly reduces the magnitude of the weights during the optimization process.
You may also find the following thread helpful.
In Machine Learning, “penalty” typically refers to a regularization term added to the loss function to discourage overfitting by controlling model complexity.
Also without including the full expression for weight updating, your expression does not minimise the cost function J(\vec{w}, b)
My expression is the objective function from the lecture you are referring to.
I cannot find any reference to the term 1000 \cdot u in Prof. Ng’s video lesson.
Also I am unfamiliar with “dot-operator” being applied to a number argument and a variable argument, only between vector arguments like \vec a \cdot \vec b.
What do you mean by an “objective function”? What is that?
And what do you mean by “…controlling model complexity…”?
Also I am unfamiliar with “dot-operator” being applied to a number argument and a variable argument, only between vector arguments like \vec{a} \cdot \vec{b}.
I am unfamiliar with “dot-operator” either, but here “\cdot” means a scalar to scalar product because both 1000 and w_i are scalars.
What do you mean by an “objective function”? What is that?
It is J(\vec{w}, b)
And what do you mean by “…controlling model complexity…”?
When we talk about controlling model complexity, we mean preventing a machine learning model from becoming too simple (underfitting) or too complex (overfitting).
I have watched and listened to the video again at 1 min 52 secs but I still cannot see or hear about the term 1000 \cdot u.
I think it might be clearer what you mean by avoiding using the dot-operator for vector operations and just write 1000u to mean simple multiplication of two numbers. Otherwise, it can be confusing and ambiguous.
I think it is best to avoid using words and terms that Prof. Ng doesn’t use in his lessons - like “…objective function…” and “…controlling model conplexity…” as it can only lead to confusion and ambiguity.
@ai_is_cool
The term 1000 \cdot w_i follows standard notation where the dot represents scalar multiplication, which is commonly used in mathematical and programming contexts. However, I understand that notation preferences can vary. If you find 1000 w_i clearer, that’s totally reasonable. Objective function is the most general term for any function that you optimize during training. Model complexity refers to the capacity of a model to fit data, i.e. a “simple” model such as the linear y = \theta_0 + \theta_1 x, or a more “complex” model such as the polynomial y = \theta_0 + \theta_1 x + \dots + \theta_5 x^5.
Prof. Ng used this terminology in CS229. Using different wording can sometimes help clarify ideas, especially for those who may have encountered these terms elsewhere. But I understand the importance of sticking to familiar terminology for consistency. Thanks for sharing your perspective!
Thanks for that but I think it is important that only the words and nomenclature appearing in the course should be used to avoid ambiguity and lack of clarity.