How is it possible that the model response is same everytime?

I was going through the lab 1 assignment. I noticed that the output from the model is same every time even if i run the cell multiple times. If I give any other chat bot (such as ChatGPT) the same prompt, the reply would be slightly different almost every time.

For Example.
The response of MODEL GENERATION - WITHOUT PROMPT ENGINEERING:
is always “Person1: It’s ten to nine.”

How is it possible that the LLM will generate “Person1: It’s ten to nine.“ every time ?

1 Like

These are probabilistic outputs and that model since the variety of training is less than chatgpt will always give the highest probability to the same output, since the number of outputs is also limited. Try changing the wording a little bit!

There might also be a temperature variable playing here, (regarding the randomness of the outputs), if that is fixed than the probabilities of the outputs are not shuffled, so the same result output is picked up always.

3 Likes

I can show why I think this happens based on my exploration using a version of GPT-2 that I have running on my laptop. I don’t know how close this is to the implementation of the more advanced models, and if I did, I couldn’t tell you. But maybe this helps.

When you provide a text string to a GPT LLM, the first thing it does is tokenize and encode the input; convert from human readable to numbers. This is necessarily a deterministic step. For a given tokenizer and embedding, the same input string will be digitized the same way. Notice that the inverse step is true as well. Namely, for a given numeric representation of a token, it decodes back to the same human readable word every time. Imagine the havoc if it didn’t work this way. For example, using my local environment the input string ‘What is machine learning?’ is tokenized and encoded as the tensor [ 2061, 318, 10850, 18252, 30]
Encoded word: 2061 Decoded word: What
Encoded word: 318 Decoded word: is
Encoded word: 10850 Decoded word: Machine
Encoded word: 18252 Decoded word: Learning
Encoded word: 30 Decoded word: ?

The next step is that the numeric representation of the input string is passed into the LLM model.generate() method, which in turn invokes the model.forward() method. This generates a vector of candidates for the next token. This also is a deterministic step, based on the training of the model and its embedding algorithm. The relationships between words are learned during training, but once the training is complete, the numbers are static until training is performed again. The relationship between the input string and the candidate next token(s) are often converted to probabilities. Here is the call to forward() that produces the probabilities.

               #call forward() to produce the predicted tokens
            logits, _ = self.forward(token_indices_condition)

               #scale by the temperature setting
            logits = logits[:, -1, :] / temperature

               # Apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)

Here they are for the input string given above:

GPT.forward() input string: “What is Machine Learning?”

top 10 token candidates are:
probability: 93.59% word: \n
probability: 2.04% word: Machine
probability: 0.93% word: It
probability: 0.70% word: How
probability: 0.62% word: What
probability: 0.54% word: \n\n
probability: 0.52% word: Why
probability: 0.40% word: Well
probability: 0.36% word: The
probability: 0.31% word: A

One of the candidate tokens is appended to the input string and the process repeats, generating a new set of candidates. With my configuration, the top probability candidate, the newline, is appended in each of the first two steps. Here is the 7th step:

GPT.forward() input string: “What is Machine Learning?\n\nMachine learning is a”

top 10 token candidates are:
probability: 32.40% word: new
probability: 15.23% word: technique
probability: 8.68% word: process
probability: 7.28% word: way
probability: 6.37% word: type
probability: 6.32% word: set
probability: 6.28% word: field
probability: 6.22% word: powerful
probability: 5.79% word: term
probability: 5.43% word: method

This continues until the maximum length of the response string has been reached.

My observation is that the candidate vector will be the same every time a given string is used as input with the same tokenizer, embedding, trained model and set of configuration parameters. That is, it is deterministic. If the candidate selection flag is set to top_k and the temperature setting mentioned by @gent.spah is not close to 1, the application will take the highest probability candidate and it will be the same every time to rerun the app.

Note the division by temperature in the code fragment above. Putting a small number in the denominator there alters the probability distribution returned by softmax and makes it more likely that other candidates can be chosen if not always choosing the single highest probability. The other option is to use sampling. Here is how my implementation does this…

            if do_sample:
                token_index_next = torch.multinomial(probs, num_samples=1)
            else:
                _, token_index_next = torch.topk(probs, k=1, dim=-1)

From this I hope you can see that if the configuration is set up to use the multinomial sample of candidates, and particularly when the temperature is above about 0.6, the system can return different tokens. But when the configuration is set to select the highest probability candidate, and nothing else is changed (tokenizer, embedding algorithm and training, vocabulary etc) then you can expect deterministic results.

NOTE that this doesn’t mean correct or accurate results, it just means repeatable. The results will recapitulate the model’s training in a predictable way, even if that training produces incorrect or inaccurate output.

Here’s the doc on torch.multinomial

Hope this makes some sense. Welcome feedback and opportunity to improve it.

This is a very detailed and step-by-step explanation, thank you @ai_curious

1 Like

Last year I spent some cycles trying to understand exactly how this works. Wrote about it here: https://community.deeplearning.ai/t/can-anyone-help-me-understand-what-temperature-does-in-gpt

As usual, @rmwkwok and @paulinpaloalto were ahead of me on the math.

BTW, I am under the impression that the modern implementations use temperature a little differently than GPT-2. In my implementation, it appears in a denominator and thus cannot ever be zero. I don’t have code for the recent models, but understand that now temperature can be set to 0. The behavior is still that lowering the temperature towards 0 makes the output more deterministic.

1 Like

:grin:

Take Llama 3 as example, it treats T = 0 as a special case and do not use the softmax formula which means no 1/T at all. Instead, it becomes deterministic and apply argmax to get the most probable token.

Check this code out.

Don’t know exactly how ChatGPT deals with that.

Cheers,
Raymond

1 Like

But the choice of argmax makes sense because as T approaches 0, all probability mass goes to the maximum logit:

This means that one token gets probability of 1 and all others get 0, so there is no need to do any random drawing at all.

So, I think ChatGPT should do the same, too.

2 Likes

Thanks for adding these to the conversation. I notice that one layer down from the llama code you linked above, sample_top_p calls torch.multinomial , which is consistent with the older GPT-2 style code I have. Also, the effect of low T on softmax you derived is consistent with my anecdotal evidence graphed on my linked thread about Temperature. You can clearly see the shift towards a single top candidate well before T approaches 0.

Graphs linked here

I started learning GPT-2 using a tensorflow and keras implementation I found on the web from François Chollet, but switched to pyT when I had environmental incoherence and couldn’t resolve in a reasonable time. Now that I have a working pyT environment I should probably bring down the llama code and tinker with it. Thanks for the impetus.

1 Like