Preprocessing for LLM

How should I preprocess the dataset for LLM pretraining (from scratch) ?
Let’s say for a task such as text generation or summarization (Some seq2seq architecture), how should I preprocess the text ? How to deal with stop words in these cases ? Is there any special considerations ?

Thanks in advance.

I’ll advise you to take the NLP with Classification and Vector Spaces course. If you go on to complete the specialization, you’ll be well grounded on all the basics.