Preprocessing for LLM

Yaniel_de_Sousa_Barb · November 26, 2023, 8:49am

How should I preprocess the dataset for LLM pretraining (from scratch) ?
Let’s say for a task such as text generation or summarization (Some seq2seq architecture), how should I preprocess the text ? How to deal with stop words in these cases ? Is there any special considerations ?

Thanks in advance.

lukmanaj · November 26, 2023, 9:40am

I’ll advise you to take the NLP with Classification and Vector Spaces course. If you go on to complete the specialization, you’ll be well grounded on all the basics.

Topic		Replies	Views
A simple question about text preprocessing prerequisites NLP with Sequence Models week-1	1	238	February 13, 2024
Week3 - I have just completed the course, excited to put my knowledge into practice! Generative AI with Large Language Models week1	2	31	October 15, 2024
How to create dataset on a specific topic to fine tune llm? Finetuning Large Language Models	0	142	November 27, 2023
Lab 2 Week 2 Generative AI with Large Language Models week-2	1	154	April 22, 2024
Can you mix and match different types of data? Finetuning Large Language Models	2	105	September 21, 2023

Preprocessing for LLM

Related topics