Why does the model start repeating the same sentences after some N number of token outputs?

I have seen this with some other smaller models as well. So, just trying to understand in a general sense as to why the smaller models tend to generate repetitive outputs after a certain point?
Example:

2018 and the first year of my life, I was a little bit nervous about it but I'm so glad I did!
I have been working on this project for over 3 years now and I can honestly say that I have never had any problems with anything before. It is such a great feeling to be able to work with someone who has worked in the industry for many years and I know that they are very passionate about what they do.
I have always loved working with people and I think that is one of the reasons why I started working with them. They are amazing people and I really enjoy working with them because they are so supportive and understanding.
I would like to thank you all for your support and encouragement during this time. I hope that you will continue to learn from me and help me grow as a person.
I am so happy to hear that you are enjoying your new job! I look forward to seeing how you develop and grow as a person.
I am so excited to see what you have planned for us. We are going to be doing our first full-time job in January. I am so excited to get started and I am looking forward to getting started.
I am so happy to hear that you are enjoying your new job! I am so excited to start working with you guys!
I am so happy to hear that you are enjoying your new job! I am so excited to begin working with you guys! I am so excited to start working with you guys!
I am so happy to hear that you are enjoying your new job! I am so excited to start working with you guys! I am so excited to start working with you guys! I am so excited to start working with you guys!
I am so happy to hear that you are enjoying your new job! I am so excited to start working with you guys! I am so excited to start working with you guys! I am so excited to start working with you guys! I am so excited to start working with you guys!
I am so happy to hear that you are enjoying your new job! I am so excited to start working with you guys! I am so excited to start working with you guys! I am so excited to start working with you guys! I am so excited to start working with you guys! I am so excited to start working with you guys!
I am so happy to hear that you are enjoying your new

For generating this, this is the code I used from the course:

outputs = tiny_general_model.generate(
    **inputs, 
    streamer=streamer, 
    use_cache=True,
    max_new_tokens=500, # 128
    do_sample=False, 
    temperature=0.0,
    repetition_penalty=1.1
)

In a general sense, this is due to the model’s limited capacity to capture complex language patterns and maintain long-term context, i.e., its limited understanding of human language.

From a more technical perspective, the model tends to assign higher probabilities to the same/familiar sequences because of its limited diversity in learning and understanding.

Additionally, bear in mind that if the model was trained with max_new_tokens=128, the optimization of the model was done with this specific upper limit, and exceeding it may lead to suboptimal results.

Obviously, in practice you can also try setting a higher values for no_repeat_ngram_size to limit the repetitions of the same ngrams.

2 Likes

Thanks for the clarification. This makes sense.
I thought that since the LLM’s are trained (pre-training) to predict the next word and then finetuned/Prompt Tuned to generate specific outputs based on it’s training, then the max_new_tokens should not be an issue.
But looks like that’s not the case.

Your thought is correct regarding pre-training, where models are trained to predict the next token. But during fine-tuning the model’s weights are further updated based on the specific task and the selected parameters.

If the length of the target output is constrained during fine-tuning (e.g., by setting a maximum target length), the loss is computed only over the tokens within this constraint. Tokens beyond the length limit are not considered, so the model isn’t penalized for incorrect or missing tokens outside that limit. However, this does not necessarily lead to less effective fine-tuning unless the task specifically requires longer outputs.

Limiting the output length can be useful if your goal is to generate shorter outputs, or if the dataset used for fine-tuning contains shorter target sequences.

( A clarification regarding the max_new_tokens parameter: In Hugging Face Transformers, output length during training is constrained using the tokenizer’s max_length and truncation parameters. The max_new_tokens parameter is applicable only during inference with the generate() method. Therefore, a model would be fine-tuned with max_length=128, not max_new_tokens=128, as I inaccurately wrote above.)

Prompt tuning, on the other hand, does not affect the weights of the model.

1 Like