This thread might be interesting for you as well since it’s about the way from RNNs to the transformer architecture, referring to highly relevant and popular papers:
Best regards Christian