How should I preprocess the dataset for LLM pretraining (from scratch) ?
Let’s say for a task such as text generation or summarization (Some seq2seq architecture), how should I preprocess the text ? How to deal with stop words in these cases ? Is there any special considerations ?
Thanks in advance.