In week 1, Andrew explains the GD algorithm and we implement it in the lab. I’m having trouble consolidating Andrew’s theoretical explanation, and the implementation of the algorithm in practice:

Andrew’s description of GD is that the algorithm starts in an arbitrary location, and then proceeds towards the location that minimises the cost.

The way the algorithm works however, is that it runs over all m training examples and updates the parameters w and b.

My question is:
In Andrew’s description, the advancement towards the minimum, is on a path. It doesn’t have to be a straight path, but it is a path. The algorithm however, runs all over the space, from one training example to another. The training examples are not sorted, and so the algorithm computes and updates parameters in one location, then jumps to another training example which is not necessarily alongside the previous one. How can the theoretical explanation, and the actual implementation be reconciled?

Hi,
It requires the math to understand the theory. That somehow professor Ng avoid to make things simpler. As he often says ‘dont worry about the math’. Here is the [link](Gradient descent (article) | Khan Academy you can look at to understand about the algorithm.

In addition, you get a chance to implemet that algorithm from scratch in the programming exercises.

What you sent in the link is the same explanation Andrew gives. It does not explain why going over all unsorted training examples will produce an advancement in one direction.

This really comes down to the concept of continuity. When we think of a smooth (C^1) surface or curve, we think of “traveling” along the curve until we reach the global minimum (remember we’re working with convex functions here). But with examples, think of this as picking random points to travel to instead of a structured path. The beauty of the gradient descent is that with enough picked points you can still reach the global minimum. Indeed, the idea of continuity is that ANY path you pick must take you to the same location when minimizing. In 1D (a line) you can only travel left and right (left-side and right-side limits) and those two paths must agree. In n-dimensional configurations, you must be able to take ANY path. Hope this helps!

Gradient Descent is a medium to find the lowest loss a model can have. This is generally known as global minimum.

How do we find that?

We subtract the minimum change in weight from the old weight. The other parameters like the learning rate also helps in cutting down the loss. Lower learning rate makes the weight change lower.

We pick random values and move down the slope. It continues till we reach the minima or the lowest value of the loss.

The losses can be calculated with the help of MSE (for regression) and cross entropy (for classification)