Week_3_lab_09_vectorized implementation of cost function and regularization term

I’am trying to implement the cost function using vectorized notation by converting the cost function equation to vectorized notation J(\vec w,b)=\frac{1}{2m}||f_{w,b}(X)-y||^2_2-\frac{\lambda}{2m}||w||^2_2=\frac{1}{2m}(f_{w,b}(X)-y)^T(f_{w,b}(X)-y)-\frac{\lambda}{2m}w^Tw. Is this the best implementation and notation for cost function ?

In many other notation add regularization L_1 term what is the difference between L_1 and L_2 and how it will effects on solution ?

Hey @eslam_shaheen,

Defining the best implementation is in my opinion subjective in nature. If you ask me this question, then I would ask you to define “best”. Do you refer to the most concise implementation, or perhaps the most interpretable one, or perhaps the fastest one, and so on. So, I don’t think there is something like the “best” implementation, at least as far as learning is concerned.

What you have implemented seems good to me, and it should pass all the test cases, and if it doesn’t, then perhaps there might be some small issue with the implementation. Also, it is possible, that even though your vectorized implementation is correct, auto-grader still fails your implementation. This is because of the inherent differences in the various functions that we use in the vectorized implementation, although I think it would rarely happen. I hope this answers your first question.

As for the second question, L_1 is another form of regularization, just like L_2, in which instead of minimising the squares of the weights (like we do in L_2), we minimise the absolute value of the weights. L_1 tends to encourage more sparse weights. I don’t recall Prof Andrew mentioning L_1 Regularization in the entire specialization, but if you do find, please do let me know for future references. Also, you can take a look at the below mentioned resource (amazing blog) for more in-depth knowledge about L_1 regularization.

Fighting Overfitting With L1 or L2 Regularization: Which One Is Better?

I hope this answers your second question.

Regards,
Elemento

1 Like

Hi @eslam_shaheen ,

Here is a demo from sklearn. If you look at the row for C=0.01, it is obvious that L1 has more white grids than L2, which shows that L1 is able to suppress more weights towards zero. Note that the behavior of C is that the smaller it is, the stronger the regularization becomes.

Raymond

1 Like

Hey @eslam_shaheen,
C is the inverse of the regularization parameter \lambda, and hence, smaller the value of C, larger is the value of \lambda. You will find that in the sklearn’s implementation of Logistic Regression, c has been used instead of \lambda. This is being done in order to create a uniform convention throughout the sklearn’s implementations of algorithms, such as Logistic Regression, SVM, etc. If this doesn’t resonate with you much, feel free to leave it. Just remember that C is the inverse of the regularization parameter \lambda whenever you use the sklearn’s implementation of Logistic Regression or any other algorithm that supports regularization.

Regards,
Elemento

1 Like