Questions about week 1 content

Hello, I’ve got a couple of questions from week 1 lectures and would appreciate if you could help me with them.

  1. I know from previous courses in statistics that the problem of linear regression can be solved using other methods such as maximum likelihood estimation for example. Here we learned the gradient descent algorithm to find the best fit. Now I am wondering how I should approach problems in general? How should I know which method is the most suitable for my data?
  2. I don’t quite understand what is meant by “convergence” in the context of gradient descent algorithm? How is “convergence” formulated mathematically? In the final lab we used 10000 iterations, to make sure model parameters converge. I was thinking of using a while loop to repeat the algorithm until “w” and “b” converge, but I am not sure what the condition should be.
  3. In one of the lectures it was mentioned that one of the issues with the gradient descent algorithm is that we might end up at a local minimum instead of the global minimum depending on our initializations (assuming cost functions other than the squared error). But I did not understand what the solution to this issue is.

Thank you very much for your time!

Gradient descent has a computational advantage if the data set is large or has many features, because computing the gradients is extremely easy. Statistical methods require computing lots of means and sums of squares.

Convergence means that the minimum cost has been found. The method used in Week 1 (using a fixed number of iterations) is presented for simplicity - there are better methods discussed later in the course.

If a cost function is convex, then there are no local minima. That’s the case for the squared-error cost function. If you have a non-convex cost function (such as for a neural network), then you can run gradient descent multiple times, using different random initial weight values, and use the solution that gave the lowest cost.

1 Like