Road to improving DALLE-like text-to-image generation

Hi, everyone!

Now that DALLE-2 has recently been presented, I would like to share a great article about the evaluation of different transformer (Vaswani et al.) architectures for text-to-image generation with DALLE-mini.

Some of the key insights of this work are:

  • In Pre-LN type of architectures, the model will not converge unless there is a final Layer Normalization (Ba, Kiros & Hinton) in the decoder.
  • The use of bias in dense layers is not recommended. It adds 15% training time per step and the convergence is affected negatively.
  • GLU variants are always beneficial even if they require extra memory for the same amount of parameters.

I hope you find it useful and share your thoughts about it!