Supervised Machine Learning Optional lab: Gradient descent Question

Hello, while doing the lab for Gradient descent, I have found something that I can’t understand on how the function is implemented gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function).

If I undertood well the w_final and b_final returned are the the w and b calculated at the iteration number 10000 that’s mean the latest iteration, that I think it’s not correct, because it’s not sure that the last iteration is returning the minimum cost value calculated between all the iterations (I am reading the code wrongly ? ), I think we should save the w_final and b_final depending on the minimum cost calculated, and return them at the iteration number 10000.

Could someone please explain me that ? Please find below the function

def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function): 
    
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    p_history = []
    b = b_in
    w = w_in
    
    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        dj_dw, dj_db = gradient_function(x, y, w , b)     

        # Update Parameters using equation (3) above
        b = b - alpha * dj_db                            
        w = w - alpha * dj_dw                            

        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(x, y, w , b))
            p_history.append([w,b])
        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters/10) == 0:
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}")
 
    return w, b, J_history, p_history #return w and J,w history for graphing
1 Like

Hello @SAOUTARRIH_Marouane

w_final and b_final are, as you said, the last iteration’s w and b respectively.

As you also said, it is not guaranteed to be the w and b associated with the minimum cost. However, due to the character of gradient descent, and given that a reasonable learning rate is used, the cost (with respect to the training set) should only decrease over iterations, and so it is very likely that w_final and b_final corresponds to the minimum cost measured on the training set.

Cheers,
Raymond

@SAOUTARRIH_Marouane,

Your idea is more interesting if we include another set of data called the “dev set”.

The idea of a “dev set” will be covered in Course 2 but it is a separate dataset which is not for training a model but for evaluating a model. When training, besides computing the cost with the training set, we can also compute another cost but with the dev set.

While the training set’s cost tends to only decrease over iterations, dev set’s cost usually stops decreasing at some point of the training process and then begins to rise. The U-turn of the dev set’s cost is a good signal that that iteration’s (not necessarily the final iteration) w and b are the best. In such case, we would want to have that w and b from the iteration of that U-turn instead of the final iteration’s.

However, our lab doesn’t use a dev set, so the cost is always computed on the training set, and consequently, we can expect such cost to be always decreasing, and thus returning the final round’s training parameter is sufficient.

Cheers,
Raymond

Thank you for the explanation, but I would like to suggest adding some clarification to the Lab in order to prevent confusion among the learners.

  • It would be helpful to include an explanation of how the number of iterations has been determined. Additionally, it should be explained that although this algorithm does not guarantee 100% optimal values for W and B, it does provide the closest possible values. Once the optimal value is reached, the algorithm may continue to fluctuate between the same values if the step size is small enough ( If it’s not appropriate to delve into this at the moment, we can mention that the current approach should be followed, and more detailed explanations will be provided in the upcoming weeks).
  • Regarding the Alpha value (learning rate), it would also be beneficial to explain how this value has been selected ( If it’s not the right time to address this, we can mention that the current value should be used for now, and further explanations will be provided in the future weeks) .

You are welcome, @SAOUTARRIH_Marouane. :wink:

Just to add some information. We have discussed learning rate in a lecture before that lab, on the effect of having a too large learning rate or having a too small learning rate - and I believe the lab has also demonstrated that with a too large learning rate, what kind of problem we are going to run into.

As for the exact value, with that lecture in mind, the choice could be any small enough value that won’t make the training too slow. I am not sure if that lab was tuned very carefully to use the best learning rate, but it is also something we can try ourselves - even though it is not asked by the lab.

Cheers,
Raymond

@rmwkwok
Hello sir , I have an additional question regarding this lab, why was the number 10000 chosen for the number of iterations to calculate w and b , rather then repeating until convergence in other words , until the partial derivative reach 0 which would be at the local minima.

That’s a good question, @Mahmoud_Mokhiamar. I have just glanced through the slides again and although I am not 100% sure, but I think we simply didn’t actually cover how we determine convergence, so in the labs we were only demonstrating gradient descent but not guaranteeing convergence.

However, we can briefly talk about convergence here.

A very straightforward approach is by checking the change in the cost value computed on the training set, and say it converged if the change is smaller than a certain threshold. However, there is no simple way to determine the threshold because it depends on the problem - for example, if it is a regression, the scale of the labels matters and how tolerable we are to the size of errors also matters.

Therefore, although the above way is straightforward, it is not the most practical approach.

Another approach is to compute the cost value on a separate cv set. Note that the concept of “cv set” will be taught in course 2, but to give you a quick idea, a “cv set” is just a separate dataset that you keep it away from being used for doing gradient descent, instead, after we complete gradient descent over the whole training set once (we call it after an epoch), we compute the cost value with respect to the cv set.

Since the gradient descent is not based on the cv set, the cv set provided an independent evaluation of the performance of the model at every epoch. Once such performance stops improving (or begins to deteroriate), we may stop the training.

In contrast to being difficult to set the threshold in the first approach, the signal of the second approach is very clear. Very often we will see the cv set cost to decrease at first, then such decreasing will be slowing down and then at some point the cv set cost will start to increase and that’s the exact moment we will want to freeze our trainable parameters to, even though it’s a common practice to keep training for a few more epochs to see whether or not it really won’t improve any more and if indeed not, we go back to the state where we wanted to freeze the trainable parameters to.

The name for the second approach is called “Early stopping” and is a very commonly used approach for determining algorithmically when to stop training. If you would like to search for more articles, that will be the keywords to begin with.

Cheers,
Raymond