A complete training over the entire dataset such that each example has been seen once. It is called as an epoch represents N/batch size training iterations, where N is the total number of examples.
Where as iteration is a single update of a model’s weights during training. An iteration consists of computing the gradients of the parameters with respect to the loss on a single batch of data.
The weights are updated by computing the gradient at each iteration.
I will be happy to explain your questions but can you point me to the video or page you are referring, so my response to your question don’t confuse you.
Indexing an array with X[:m] selects the first m elements of X (along the first dimension if X is multi-dimensional). For each value of m in the for loop, it’s saying “let’s pretend we only had m data points, what would our training and validation accuracies be?”. You should see that, for small m, the model would overfit so the training accuracy would be close to perfect and the validation accuracy would be very low. As m increases, the training accuracy will decrease, but the validation accuracy increase. The exact shapes of the curves are useful for diagnosing underfitting/overfitting.
The indices for each dimension are separated by a comma, not a colon. Colons separate the start and end index, if both are used.
For example
X[start_row:end_row, start_column:end_column]. X[:m] is all rows up to the mth, all columns
Remember when I mentioned N, I said it is number of total examples and parameter weights.
I’ll take a swing at answering your question.
Here is the name of the lab:
The model has five parameters to optmimize: four weights and a bias value. That totals five parameters, but this does not figure into the calculation you asked for.
The model uses the SGDRegressor from scikit-learn.
SGD is “stochastic gradient descent”. This means that the optimizer processes each individual example and updates the weights, and continues cycling through all of the examples until it gets convergence.
Each pass through the whole dataset is one iteration.
The regressor was configured to use at most 1000 iterations, but only used 119.
If you add a line of code and inspect the number of examples in the dataset, you see there are 99 examples.
Doing some math, 99 examples * 119 iterations = 11,781 updates of the weights.
I do not know where the extra weight update comes from.
print(f"number of iterations completed: {sgdr.n_iter_}, number of weight updates: {sgdr.t_}")
SGDRegressor performs one weight update (with gradient descent) per sample. An iteration is one pass of the whole training set.
If we have m samples, then each iteration will have m updates. sgdr.n_iter_ tells us the number of times the whole training set was used, which was 124. sgdr.t_ tells us, as explained in sklearn’s documentation which is also quoted below, the result of this formula: “(n_iter_ * n_samples + 1)” which is 124 \times m +1 = 124 \times 99 + 1= 12277.
I have quickly read this and this code and found that the variable t_ always starts from 1 instead of 0, so I think the meaning of it should be “number of weight updates (124 \times 99) plus one”, and here I am specifically referring to the weight updates carried out with gradient descent. In fact, if we look at these lines, we can clearly see that t here was not meant to be the number of weight updates, only it can be calculated back by subtracting t_ with one.
I suggest you to follow through the sklearn code once and you should come across the pieces of code linked above, because I think it will be nice to base the discussion on the code and we can share our finding if there is any disagreement!
Thanks Raymond for taking the time to reply to my question however it is still not clear to me what the difference is between an iteration and a weight update.
One weight update for the first weight w_0 is the following operation in pseudo code:
yes if you considering that gradient function as gradient cost function then that is correct.
but although usually initially weight is assigned zero.
the usual equation would
w_0 = 0
w_1 = w_0 - alpha * gradient function (dJ/DW)
all the weights are assigned zero as the initial value is zero initialization. This kind of initialization is highly ineffective as neurons learn the same feature during each iteration.
If you’re doing batch gradient descent, then you accumulate the gradients for each feature over the whole training set, then update all of the weights at the end. So that’s one update per batch.
If you’re doing stochastic gradient descent, then you compute a weight update every time you compute the gradients for each example.
You may be thinking about this in a little more detail than is really necessary.
“update” just means that a new value stored in some variable. It’s not really a “term of art” in the machine learning world.
I don’t think this is really a critical concept at this point. There are lots of lectures left in this course, perhaps the topic will become clearer for you later.
I feel that maybe your learning is getting bogged-down over a minor issue regarding terminology.