When in transformer paper it is said that encoder layer is composed of stack of N = 6 identical layer and decoder is also composed of stack of N = 6 identical layers.
I have a question take for example N=2 how two layers of encoder are connected to each other?
In case of encoder Ex - Does the output from the 1layer gets connected to the multihead head attention of 2nd layer and then the output from the 2nd layer(output of feed forward neural network) acts as supplying keys and values to the first layer of decoder? Or are the stacked parallel?
Similarly in case of decoder does N layers are arranged in sequential manner the output of 1st layer goes to the multihead attention of 2nd layer?
Also in decoder the 2nd Multihead attention block, from which layer of encoder does it receives keys and values (like 1st decoder layer could receive keys and values from 1st layer of encoder or 1st layer of decoder receives its keys and values from last layer of encoder, likewise from which layer of encoder does 2nd layer of decoder (2nd multihead attention) receives its keys and values?