Dear AI experts:

I passed the assignment. But I have a question about the gradient descent part of the assignment code.

In the assignment the codes (pasted below without answers) seems to be calculating one gradient descent and updating the W1 and W2 for each batch of data x,y. As new batches of data comes in, the process is repeated as driven by the for loop: “for x, y in get_batches(data, word2Ind, V, C, batch_size):”.

I wonder if through out the entire training data, the beginning of the data is very different from the end of the data, e.g. in contents, formats, syntax, etc, for example beginning are all novels and endings are all poems, would w1 and w2 migrate with the data change without true convergence? My previous understanding of training/model converging, if I recall correctly, is to do a round of gradient descent for the entire dataset, then do it again and again to reach a true global minimum. Am I not understanding the process correctly? Can you elaborate?

for x, y in get_batches(data, word2Ind, V, C, batch_size):

### START CODE HERE (Replace instances of ‘None’ with your own code) ###

# get z and h

z, h =

```
# get yhat
yhat =
# get cost
cost =
if ( (iters+1) % 10 == 0):
print(f"iters: {iters + 1} cost: {cost:.6f}")
# get gradients
grad_W1, grad_W2, grad_b1, grad_b2 =
# update weights and biases
W1 =
W2 =
b1 =
b2 =
### END CODE HERE ###
iters +=1
if iters == num_iters:
break
if iters % 100 == 0:
alpha *= 0.66
```