Confused over number of iterations and number of weight updates

What is the difference between the number of iterations and the number of weight updates in the Optional Lab : Linear Regression using SciKit-Learn?

The output from running the code under the heading “Create and fit the regression model” is shown below;

“number of iterations completed: 124, number of weight updates: 12277.0”

But I don’t understand the difference here.

A complete training over the entire dataset such that each example has been seen once. It is called as an epoch represents N/batch size training iterations, where N is the total number of examples.

Where as iteration is a single update of a model’s weights during training. An iteration consists of computing the gradients of the parameters with respect to the loss on a single batch of data.

The weights are updated by computing the gradient at each iteration.

1 Like

It’s still not very clear to me.

In the Machine Learning Specialization by Andrew Ng, there are “m” training input rows, not N. And there are “n” parameter weights.

Please explain how the very first weight update is computed in the learning algorithm.

I will be happy to explain your questions but can you point me to the video or page you are referring, so my response to your question don’t confuse you.

Indexing an array with X[:m] selects the first m elements of X (along the first dimension if X is multi-dimensional). For each value of m in the for loop, it’s saying “let’s pretend we only had m data points, what would our training and validation accuracies be?”. You should see that, for small m, the model would overfit so the training accuracy would be close to perfect and the validation accuracy would be very low. As m increases, the training accuracy will decrease, but the validation accuracy increase. The exact shapes of the curves are useful for diagnosing underfitting/overfitting.

The indices for each dimension are separated by a comma, not a colon. Colons separate the start and end index, if both are used.

For example
X[start_row:end_row, start_column:end_column]. X[:m] is all rows up to the mth, all columns

Remember when I mentioned N, I said it is number of total examples and parameter weights.

Looks like your question was answered by @Matrixx’s post. Deleted my answer to avoid confusion.

It have received replies but none of them have answered my question.

m is the number of training examples, n is the number if parameter weights.

Have you got an answer to my question please?

I’ll take a swing at answering your question.
Here is the name of the lab:
image

The model has five parameters to optmimize: four weights and a bias value. That totals five parameters, but this does not figure into the calculation you asked for.

The model uses the SGDRegressor from scikit-learn.

SGD is “stochastic gradient descent”. This means that the optimizer processes each individual example and updates the weights, and continues cycling through all of the examples until it gets convergence.
Each pass through the whole dataset is one iteration.

The regressor was configured to use at most 1000 iterations, but only used 119.

If you add a line of code and inspect the number of examples in the dataset, you see there are 99 examples.

Doing some math, 99 examples * 119 iterations = 11,781 updates of the weights.

I do not know where the extra weight update comes from.

:QED

Hello @ai_is_cool,

print(f"number of iterations completed: {sgdr.n_iter_}, number of weight updates: {sgdr.t_}")

SGDRegressor performs one weight update (with gradient descent) per sample. An iteration is one pass of the whole training set.

If we have m samples, then each iteration will have m updates. sgdr.n_iter_ tells us the number of times the whole training set was used, which was 124. sgdr.t_ tells us, as explained in sklearn’s documentation which is also quoted below, the result of this formula: “(n_iter_ * n_samples + 1)” which is 124 \times m +1 = 124 \times 99 + 1= 12277.

image

I have quickly read this and this code and found that the variable t_ always starts from 1 instead of 0, so I think the meaning of it should be “number of weight updates (124 \times 99) plus one”, and here I am specifically referring to the weight updates carried out with gradient descent. In fact, if we look at these lines, we can clearly see that t here was not meant to be the number of weight updates, only it can be calculated back by subtracting t_ with one.

I suggest you to follow through the sklearn code once and you should come across the pieces of code linked above, because I think it will be nice to base the discussion on the code and we can share our finding if there is any disagreement!

Cheers,
Raymond

1 Like

Thanks Raymond for taking the time to reply to my question however it is still not clear to me what the difference is between an iteration and a weight update.

One weight update for the first weight w_0 is the following operation in pseudo code:

w_0 = w_0 - alpha * gradient_function(w, X, b, Y)

Agreed?

Hello and thankyou for taking the time to reply to my question.

However, it is still not clear to me what the difference is between an iteration and a weight update.

Just to be clear, one weight update for the weight parameter w_0 is the following in pseudo-code:

w_0 = w_0 - alpha * gradient_function(w, X, b, Y)

Agreed?

Hello and thank you for taking the time to reply to my question.

However, it is still not clear to me what the difference is between an iteration and an update.

Just to be clear, one weight update of the weight w_0 is the following operation in pseudo-code;

w_0 = w_0 - alpha * gradient_function(w, X, b, Y)

Agreed?

yes if you considering that gradient function as gradient cost function then that is correct.

but although usually initially weight is assigned zero.

the usual equation would

w_0 = 0

w_1 = w_0 - alpha * gradient function (dJ/DW)

all the weights are assigned zero as the initial value is zero initialization. This kind of initialization is highly ineffective as neurons learn the same feature during each iteration.

So if the following is one weight parameter update for weight parameter w_0:

w_0 = w_0 - alpha * gradient_function(w, X, b, Y)

then the above operation is a single interation of a for-loop which executes the above operation for a specified number of times or “iterations”.

Agreed?

yes @ai_is_cool

So, logically, one weight parameter update is the same as one interation of the for-loop?

Agreed?

It depends on what is inside your for-loop.

If you’re doing batch gradient descent, then you accumulate the gradients for each feature over the whole training set, then update all of the weights at the end. So that’s one update per batch.

If you’re doing stochastic gradient descent, then you compute a weight update every time you compute the gradients for each example.

You may be thinking about this in a little more detail than is really necessary.

yes in this case it holds true, but in some algorithm where model is trained as per batch_size, there the answer will differ

as the iteration will be steps per epoch and one epoch will be one complete iterative loop passing through the whole model training dataset.

So inside the for-loop there are other computations as follows:

w_1 = w_1 - alpha * gradient_function(w, X, b, Y)

w_2 = w_2 - alpha * gradient_function(w, X, b, Y)

w_3 = w_3 - alpha * gradient_function(w, X, b, Y)

b = b - alpha * gradient_function_for_b(w, X, b, Y)

w = (w_0, w_1, w_2, w_3)

Agreed?

1 Like

“update” just means that a new value stored in some variable. It’s not really a “term of art” in the machine learning world.

I don’t think this is really a critical concept at this point. There are lots of lectures left in this course, perhaps the topic will become clearer for you later.

I feel that maybe your learning is getting bogged-down over a minor issue regarding terminology.