For the gradient descent to work, we need to find local minimum. So, as explained in lectures the loop exits until the updated w and b reaches local minima. So, what does it mean local minima or what will be the final values of w,b when it reaches local minima. How these values of w,b are considered minimum?
Also,
in the lab code of gradient descent, in the function gradient descent, how is it decided for an iteration of 10000 we get optimal w and b values?
As covered in the Course 1 week 1 videos, we are searching for the value of w and b that will correspond to the minimum value of the cost function J. We do not know what value of w and b would correspond to the minimum cost, so we use a learning algorthm to help us find it.
The value of 10,000 iterations was arbitrarily chosen. It is not guaranteed that we should always get the optimal values of w and b by setting this value. In the case of smaller datasets with lesser number of features, we can safely say that this number would suffice to get us to the minimal cost. But in bigger and more complex datasets, we might have to even go higher if the cost has not converged and reached its minimal value.
Manually when we used cost functions with some range of w and b we are able to figure out the optimal w and b. But using, gradient descent, how can it be said that for 10000 iterations we get optimal w and b not for 100 iterations.How can I tell that for 100 iterations, the w and b I get are not optimal?
Also, in multivarible linear regression, we are using a vector or w and x values? Can a straight line be formed with multiple w’s(multiple slopes)? I dont think so, then how does this multiple w’s fit into linear regression of straight line?
If the cost is still decreasing (reasonably) at 100 iterations, that’s an indication that we are still not close to the minima. Basically, 3 values keep changing at every iteration: Cost, w, b. It is then up to us to decide at what \Delta (incremental change) of these values to call it off and exit the learning algorithm. In later videos, we will visit more efficient techniques to help us decide when to stop the training.
If X has multiple features, then it is said to be n-dimensional, where n is the number of features. This is still a line in n-dimension. The mutliple w’s will be the coefficients for each of the features.
If X is represented by n features, X = [x_{1}, x_{2}, ...,x_{n}]
then w_{1}.x_{1} + w_{2}.x_{2} + ... + w_{n}.x_{n} + b will be the equation of the line
So, you mean to say. after 100 iterations, for those values of w and b , I need to check for cost. and do one more iteration and for this I need to check the cost with new w and b. So, at this point if the cost is decreasing, I need to continue to more iterations for getting optimal w and b?
Yeah, I do understand about multiple features of x concept but what I want to understand is in terms of equation of line . Can a line be formed with multiple slopes and x’s . I havent seen equation of line like below
y= m1x1+ m2x2 +b
Could you help me understand ?
In addition to what @shanup said …there are difference between local minimum cost(point) and global minimum cost(points) like this screen
…Machine learning algorithms such as gradient descent algorithms may get stuck in local minima during the training of the models. Gradient descent is able to find local minima most of the time and not global minima because the gradient does not point in the direction of the steepest descent. Current techniques to find global minima either require extremely high iteration counts or a large number of random restarts for good performance. Global optimization problems can also be quite difficult when high loss barriers exist between local minima. this is also because the data is from different distribution or (different scales)didn’t normalize . here my advice it try to initialize weights with normalize random values and also in the next course you will learn about so power full model that can get the optimal values with global minimum cost called neural network and also you will learn an optimize version (converge techniques) called adam which can reach to global minimum cost easier
Also weights w & b isn’t specific number should be equal in all models but it is weights gradient descent or any other model try to tune it to get an expected values(by multiply weights with training set w1x1+w2x2+ …+ b) close to real values we have by updating this values(W & b)
For getting optimal w and b, YES. The reason is simple: cost still reducible = not yet optimal. You can start with 100000000000 iterations, and add one or more stopping criteria to stop the iterations. For example, if you are happy to stop when the cost isn’t improved by more than 0.000001, then that is a criterion you can easily implement with code.
I would like to bring one more concept in this discussion.
Optimal w and b \ne best performing w and b.
Optimal w and b can be when the training cost is minimal.
Best performing w and b is when the cv cost (or any other metric of your choice) is minimal.
Therefore, to add a stopping criterion, think about which of the 2 you want, then calculate the cost (or metric) on the right dataset (training or cv), and then give a reasonable threshold so that the iterations will stop when the threshold is met.
Let’s be clear about the names first.
When you have ONLY ONE feature, i.e. y=w_1x_1 + b. This is called a line.
When you have two features, i.e. y=w_1x_1 + w_2x_2 + b. This is called a plane.
When you have three or more features, i.e. y=w_1x_1 + w_2x_2 + w_3x_3 + .... + b. This is called a hyperplane.
Typically you’ll make a plot of the cost for each iteration, and from this plot you can see if the cost is still decreasing. Deciding the learning rate and the number of iterations is a process of experimentation.
We’re using the linear combination of the features and weights. This does not mean the shape of the graph of the examples is a straight line. The concepts are rather different.
Thanks, Im a bit clear on the first point.
2) Regarding my second query, so what exactly w and b are called in case of planes or in general when dealing with cost functions or that equations?
a) Also, how is a multi variable linear regression means fitting a straight line?
b) Do you have any graphs or resources showing the multi-variable , b plot with cost functions?
There is no such thing. You can only fit a straight line for a one-feature problem. You can fit a plane for a two-feature problem. You can fit a hyperplane for a problem with three or more features.
Let’s take a step back, and think about this, can you draw a 4D object on a paper? What about a 30D object on a paper? You may google on how people try to visualize N-D data.
‘w’ is the weight vector. It doesn’t matter how large the vector is, it’s still a weight vector.
‘b’ is the bias value. It’s always a scalar.
For a simple example of a 2D curve that can’t be fit very well by a straight line, consider the simple parabola.
In algebra you’d use this notation: y = ax^2 + bx + c
In machine learning, you’d implement it as:
y = (w_1 * x**2) + (w_0 * x) + b
w_1 and w_0 are the two element of the ‘w’ weight vector.
x^2 and x are the two features. You’re given ‘x’ with the training set. You compute x^2 and include that as another feature. So there are two features, that’s why ‘w’ is a 2-element vector.
Let’s take a step back, and think about this, can you draw a 4D object on a paper? What about a 30D object on a paper? You may google on how people try to visualize N-D data -
Yeah, that’s exactly my point. So, there no fitting of straight line in multi-variable regression like a linear one. Its just that we figure out w’s and b using gradient descent for the model with multiple features and using this we find out final predicted value using multiple features and optimal w’s , b?
Yes. We have to do the prediction step mathematically using the optimal w’s and b. Visualization is infeasible for general high dimensional cases.
It seems to me that “fitting a line” has a special place in your mind when it comes to model fitting or model training. However, a geometric line is only the case when we have ONE single feature (or variable). For example, y=w_1x_1+b is a straight line; y=w_1x_1^2 + w_2x_1+b is a second-order single variable problem which is also a curved line; y=w_1x_1 + w_2x_2 +b is a plane.
If you had come across any reading that said it is always a line no matter how many features there are, then I am afraid you might need to read it more carefully, or you may want to read more other articles.
I have said that w is called weight, and b is called bias, because these are the names we use in this specialization. Course 1 introduces linear regression and logistic regression because they are the most basic form of many machine learning problems, and they are also a good examples to get us introduced to the concept of gradient descent which will be part of the core of deep learning. Very soon, in course 2, we will come across multi-layer neural network with many many weights. At that time, those weights will even be more difficult to be interpreted geometrically and we can only simply call them weights and bias.
Lastly, there is a term that I didn’t want to bring up when you asked me for the names of w’s and b in a multiple regression problem. That term is “normal”. It’s a pretty geometrical interpretation of the w’s. You may take a look at page 2 bullet point 1 of this PDF for some text about it and text that you may use to further your google search. However, in this course, we will not interpret w’s and b as normal, nor as slope when it comes to a many features problem. Geometrical interpretation of a many features model will also be less and less often as you go deeper and deeper into the world of Neural Network.
@rmwkwok , @TMosh Thank you for making me understand these concepts clearly and patiently sometimes with my dumb doubts. Hope to learn more during this course and from you mentors.