Title of the topic: [Visualizing the cost function | Coursera]
Context: So, I was going through this topic and the discussion was around w,b and cost function. What I have seen so far is that we find the cost for each value of f_w,b(x) and based on lowest cost we select w and b.
My question: Is there a way to get the right value for w and b, like a formula or something? Because what I am seeing so far is more like hit and trial. We assume some values for w and b based on intuition and then we calculate the cost for it and then after finding cost for multiple pairs we select the most effective one. Isn’t it time consuming?
P.S.- Apologies in advance if, I come across as ignorant. I am totally new to this, thank you.
That’s a great question! Instead of relying on a manual trial-and-error process, we typically use optimization algorithms, such as gradient descent, to efficiently identify the optimal w and b that minimize the cost function J(w, b) . This method helps avoid the time-consuming and impractical guessing of values, especially for large datasets or complex functions.
The advantage is that gradient descent uses a systematic approach to “slide down” the 3D bowl shape of the cost function to the lowest point, rather than randomly selecting points. Thus, instead of trial and error, this algorithmic approach is both systematic and efficient in finding the best w and b.
Thank you very much for your detailed response. I am going to start gradient decent tomorrow. I still have some skepticism. But for now, my curiosity has been quenched. I might post on the same thread again if I still have some uncertainties, once I finish this topic.
But thank you very much.
While your question has been answered, you might be interested to know that a closed-form solution does exist provided certain criteria are met. It allows us to compute optimal weights and biases exactly instead of iteratively computing them. In the case of linear regression, it’s possible to compute the optimal values for weights and biases using the Normal Equation. However, it only works if ( X^T X ) is invertible, which generally holds when X is of full column rank and there aren’t too many features relative to the number of data points. The Normal Equation is described at Linear least squares - Wikipedia if you are interested in learning more.
That being said, as the number of features increases, the Normal Equation becomes increasingly impractical, as the computational cost scales with the cube of the number of features. For that reason, I wouldn’t consider it a practical alternative to iterative methods for most problems.
@danieljhand@TMosh thank you for the extra bits of information. Although honestly some of what you guys said went over my head as I am a newcomer to all this, but I really appreciate the new insights.
Thank you very much.
I feel you I am also new in this topic and I had the same not-understanding (if I may say in such way) of random determination of w and b at the beginning xD. The thing is, you just have to start with something, you cannot optimize emptiness. It will make much more sense later when alpha, learning step, will be introduced. Because then you will be happy that w and b values automatically optimized to each alpha value you define.
For simple regression (linear or logistic) initial values of zero for w and b work perfectly fine. Any initial value may be chosen, and zero is as good a value as any.
Random initialization is only necessary for neural networks.