FYI, if you’re going to post code examples on the forum, please remember to enclose them in the “preformatted text” tag. That will preserve the indentation and makes the code more readable.
Just a reminder for clarity - you should not post your code for the assignments, but posting examples (or error messages) is fine.
Yes, it is an interesting point that they don’t really say much about in the assignment. I guess there is too much else new that is going on, so they don’t make a point of this. But it is paired with this function that we can see them use to “smooth” the loss:
So what that says is that they are doing an EWA (exponentially weighted average) for the loss. They just make the brief comment that this smoothing helps convergence and don’t explain any more than that, so I guess we just have to take their word for it. Well, this is an experimental science, so you could initialize the loss to 0 and then remove the smooth call and see what happens. And to complete the experiment, try smoothing with the 0 initialization and see which of the 3 strategies gives you the best convergence.
You can see the values of vocab_size and dino_names in the default values declared in the declaration of model, so it would be easy to compute the initial loss, but I just added a print and ran that cell:
initial loss 23.070858062030304
Remember when Prof Ng explained EWAs back in DLS Course 2, he mentioned that you could either start the sequence at 0 and just let it gradually stabilize or you could do bias correction to help compensate for the startup or you could do initialization, but I don’t have anything in my notes about typical initialization strategies. Notice that the \beta value is pretty small (0.001), so not doing initialization would mean it would take it a while to stabilize to a realistic value. How they came up with that specific formula I don’t know. We’d have to go back and review the lectures on EWAs in DLS Course 2.