This is just my interpretation, which is probably worth exactly what you paid for it, but I’d say that the point is not that it needs to be “propagated” from t = 10 back to t = 2, it’s that gradients get generated by the errors at every time step, right? And then we apply them (as you say) to the one shared set of weights. Of course as we discussed very recently on this other thread, the manner in which we are actually applying the gradients is arguably a bit sloppy. But it seems to work. “Close enough for jazz” apparently …
The point about state being coordinated between two disparate timesteps is what LSTM is specifically designed to facilitate. Of course the weights for the various LSTM “gates” are included in what we are updating.