How nowadays real, big models work?

Super great course! Thanks a lot.

I 'm trying to build a modern diffusion model and i have some questions :
1/ what problem do bigger images (500*500) can cause? (slower training of course and ?)

2/ what network is used for 2023 models (dallE, midjourney…) instead of Unet? I guess transformers. Waht type? Where in the architecture would you insert the context and time embedding ?

3/ For contextual embeddings with captions of type “a potato with big wheels and painted in red” you need to embed the caption in one vector. I guess the embedding used is a pretrained static one. Which one is it ? and how do you embedd a full sentence or text (more difficult than embedding a single world)