The reason is the way that RNNs work: the same cell with the same coefficients is used for each timestep. Each timestep changes the results and so gives gradients during back propagation. The way you take that into account is by adding the gradients at each timestep. Of course you’re also averaging them over the training samples as well, but you may be doing Stochastic Gradient Descent. Of course there is some ambiguity here: the gradients at a given timestep include the gradients at all the later timesteps because of the Chain Rule. Should we apply them only one step at a time and then recompute? That adds another loop to the whole process and would be very inefficient, so we just add them all up at a given iteration. Gradient Descent is an approximation method and is statistical anyway and it apparently works well enough this way.

Here’s another thread from a while back that discusses this same point in some detail. Please start with the linked post and read forward through the thread.