In the encoder block, using residual connection we are directly passing Input X to the Add & Norm Layer by skipping Multi head attention layer. If so then why output of the multi head attention layer also passed as input to the Add& Norm Layer.
As its name suggests, Add & Norm performs the addition between the input X and the multi-head attention output and layer normalization of the corresponding result. Layer normalization works similarly to batch normalization but recentering and rescaling across the feature dimension.
Batch normalization is usually less effective than layer normalization in natural language processing tasks, whose inputs are often variable-length sequences.