Is pre-training the unsupervised training of an LLM?

I’ve got this feedback on week 2 quiz

Explain me, please, what does it mean? Does it have just input data without output data on standard pretraining? Can you give me examples of data that was meant in this question?

Fill in the blanks: __________ involves using many prompt-completion examples as the labeled training dataset to continue training the model by updating its weights.  This is different from _________ where you provide prompt-completion examples during inference.

Hi @someone555777 !

Instruction fine-tuning involves training a language model by providing numerous examples of instructions and their corresponding desired outputs. This process involves:

  1. Dataset Preparation: A dataset of pairs, each consisting of an instruction and the expected output. This is an example of the dataset with each sentence as the input and its corresponding summary as the label:

Input: “The quick brown fox jumps over the lazy dog. The dog, feeling rather lethargic, simply lay there watching the fox’s playful antics.”
Label: “A lazy dog watches a playful fox.”

  1. Training: The model is fine-tuned on this dataset, adjusting its weights to minimize the difference between its predictions and the expected outputs.The fine-tuning process typically requires labeled data and computational resources but results in a model that is better aligned with the specific tasks and instructions it has been trained on. The model’s parameters are permanently adjusted based on the fine-tuning data.

In-context learning refers to a model’s ability to understand and complete tasks by using examples provided within the input context without any additional training.

In-context learning leverages the model’s pre-trained knowledge and capacity to recognize patterns from the context provided and it doesn’t require additional training; instead, the model dynamically adapts based on the examples given in the prompt.

Feel free to reach out if you encounter any further issues. Happy learning!

1 Like

To better understand how Large Language Models are constructed, you might want to look into the concept of word embedding. What they are, where they come from, how they are used in Transformer models and LLM is important. There is material on it in several DeepLearning courses and Specializations or you can read about it for example here:

Feel free to share your discoveries

edit

There is a useful summary of why pre-training doesn’t fit into either part of that quiz question here, along with a contrast of the pre-training and fine tuning tasks…

1 Like

So, why is it unsupervised if we say about fine-tuning and how is the training data look like then? As I remember, we don’t know what to expect on the output in unsupervised learning.

Pretraining is unsupervised and instruction fine-tuning is actually supervised.

Pretraining refers to training a model on a large corpus of data without using any labeled information. The training data comprises large-scale, unlabeled text data from sources like books, articles, websites, and more.

For instance:
The quick brown fox jumps over the lazy dog. The dog, feeling rather lethargic, simply lay there watching the fox’s playful antics.

Instruction fine-tuning datasets consist of labeled examples with explicit instructions and corresponding responses. These datasets are used to train the model to perform specific tasks based on the given instructions.

For instance:
Input: “The quick brown fox jumps over the lazy dog. The dog, feeling rather lethargic, simply lay there watching the fox’s playful antics.”
Summary: “A lazy dog watches a playful fox.”

The summary is the label of this input.

Feel free to reach out if you encounter any further issues. Happy learning!

oh, ok. I’ve understood just now what you meant under unsupervised pretraining. But instruction pretraining is supervised, right?

1 Like

Instruction fine-tuning is supervised. However, we generally do not implement “instruction pertaining”. It implies a preliminary training phase on instructions, which is not a part of the training process in most LLM now.

As I understand they both work MLM or CLM training. So, both will just mask some tokens and expect to have them on output. If I understand correct Instruction is just a bit structured way to do it. But the method of training is the same for Pretraining and Instruction fine-tuning, isn’t it? So, both of them should be just supervised or unsupervised.

so, I mean that for this example the input can be
The quick brown fox jumps over the lazy dog. The dog, feeling rather lethargic, simply lay there watching the fox’s playful

and output
The quick brown fox jumps over the lazy dog. The dog, feeling rather lethargic, simply lay there watching the fox’s playful antics.

label is antics.

Hi! MLM and CLM do use unsupervised learning, and after the pertaining, we can use instruction fine-tuning to enhance the performance.
Like the example above, the model predicts the next word in a sentence. We don’t need to provide labels because the next word itself serves as the label, making the process unsupervised.
However, after this initial training, the model might not perform well on specific tasks like text summarization.

Solution: Instruction Fine-Tuning:

To improve performance on tasks like text summarization, we can use instruction fine-tuning. We create summaries for some texts ourselves and use these summaries as labels.
By fine-tuning the model with these examples, and adjusting some of its parameters, the model can learn to generate better summaries.

Example:

Pretraining:
Sentence: “The quick brown fox jumps over the lazy dog.”
CLM needs to predict the next word after “the lazy.”

Fine-Tuning for Summarization:
Text: “Artificial Intelligence is transforming various industries, leading to advancements in automation and data analysis.”
Our Summary: “AI is revolutionizing industries with automation and data analysis.”

We use the summary written by ourselves as a label and fine-tune the model to improve its summarization ability.
By fine-tuning the model with specific examples, we can enhance its performance on desired tasks like text summarization.

but it is still the label

Honestly, after your answer, I still think that both methods are supervised. The only difference is that output label in pretraining is just a word and in fine-tuning is special combination of words.