Gradient Descent Algorithm Computation

Mor_Zahavi · July 12, 2023, 9:07am

In week 1, Andrew explains the GD algorithm and we implement it in the lab. I’m having trouble consolidating Andrew’s theoretical explanation, and the implementation of the algorithm in practice:

Andrew’s description of GD is that the algorithm starts in an arbitrary location, and then proceeds towards the location that minimises the cost.

The way the algorithm works however, is that it runs over all m training examples and updates the parameters w and b.

My question is:
In Andrew’s description, the advancement towards the minimum, is on a path. It doesn’t have to be a straight path, but it is a path. The algorithm however, runs all over the space, from one training example to another. The training examples are not sorted, and so the algorithm computes and updates parameters in one location, then jumps to another training example which is not necessarily alongside the previous one. How can the theoretical explanation, and the actual implementation be reconciled?

Phuc_Kien_Bui · July 12, 2023, 9:28am

Hi,
It requires the math to understand the theory. That somehow professor Ng avoid to make things simpler. As he often says ‘dont worry about the math’. Here is the [link](Gradient descent (article) | Khan Academy you can look at to understand about the algorithm.

In addition, you get a chance to implemet that algorithm from scratch in the programming exercises.

Mor_Zahavi · July 12, 2023, 10:54am

What you sent in the link is the same explanation Andrew gives. It does not explain why going over all unsorted training examples will produce an advancement in one direction.

saifkhanengr · July 12, 2023, 11:12am

The path you are referring to is a convex shape. Maybe reading my this article will help you to understand it better.

Matthew_McCoy · July 12, 2023, 2:26pm

This really comes down to the concept of continuity. When we think of a smooth (C^1) surface or curve, we think of “traveling” along the curve until we reach the global minimum (remember we’re working with convex functions here). But with examples, think of this as picking random points to travel to instead of a structured path. The beauty of the gradient descent is that with enough picked points you can still reach the global minimum. Indeed, the idea of continuity is that ANY path you pick must take you to the same location when minimizing. In 1D (a line) you can only travel left and right (left-side and right-side limits) and those two paths must agree. In n-dimensional configurations, you must be able to take ANY path. Hope this helps!

Rashmi · July 12, 2023, 2:54pm

Gradient Descent is a medium to find the lowest loss a model can have. This is generally known as global minimum.

How do we find that?

We subtract the minimum change in weight from the old weight. The other parameters like the learning rate also helps in cutting down the loss. Lower learning rate makes the weight change lower.

We pick random values and move down the slope. It continues till we reach the minima or the lowest value of the loss.

The losses can be calculated with the help of MSE (for regression) and cross entropy (for classification)

Topic		Replies	Views
Gradient Descent - Pls explain how it knows which direction from top of the hill Supervised ML: Regression and Classification week-module-1	12	569	October 21, 2023
I need to know why we choose this Supervised ML: Regression and Classification week-module-1	5	384	December 2, 2023
Gradient Descent Doubt AI Discussions	7	116	July 11, 2022
Why we need GD algorithm if we can still optimise parameters of the model? Calculus for Machine Learning and Data Science week-module-2	7	371	August 24, 2023
Convergence of Gradient Descent Supervised ML: Regression and Classification week-module-2	1	509	July 10, 2022

Gradient Descent Algorithm Computation

Related topics