I can show why I think this happens based on my exploration using a version of GPT-2 that I have running on my laptop. I don’t know how close this is to the implementation of the more advanced models, and if I did, I couldn’t tell you. But maybe this helps.
When you provide a text string to a GPT LLM, the first thing it does is tokenize and encode the input; convert from human readable to numbers. This is necessarily a deterministic step. For a given tokenizer and embedding, the same input string will be digitized the same way. Notice that the inverse step is true as well. Namely, for a given numeric representation of a token, it decodes back to the same human readable word every time. Imagine the havoc if it didn’t work this way. For example, using my local environment the input string ‘What is machine learning?’ is tokenized and encoded as the tensor [ 2061, 318, 10850, 18252, 30]
Encoded word: 2061 Decoded word: What
Encoded word: 318 Decoded word: is
Encoded word: 10850 Decoded word: Machine
Encoded word: 18252 Decoded word: Learning
Encoded word: 30 Decoded word: ?
The next step is that the numeric representation of the input string is passed into the LLM model.generate() method, which in turn invokes the model.forward() method. This generates a vector of candidates for the next token. This also is a deterministic step, based on the training of the model and its embedding algorithm. The relationships between words are learned during training, but once the training is complete, the numbers are static until training is performed again. The relationship between the input string and the candidate next token(s) are often converted to probabilities. Here is the call to forward() that produces the probabilities.
#call forward() to produce the predicted tokens
logits, _ = self.forward(token_indices_condition)
#scale by the temperature setting
logits = logits[:, -1, :] / temperature
# Apply softmax to convert logits to (normalized) probabilities
probs = F.softmax(logits, dim=-1)
Here they are for the input string given above:
GPT.forward() input string: “What is Machine Learning?”
top 10 token candidates are:
probability: 93.59% word: \n
probability: 2.04% word: Machine
probability: 0.93% word: It
probability: 0.70% word: How
probability: 0.62% word: What
probability: 0.54% word: \n\n
probability: 0.52% word: Why
probability: 0.40% word: Well
probability: 0.36% word: The
probability: 0.31% word: A
One of the candidate tokens is appended to the input string and the process repeats, generating a new set of candidates. With my configuration, the top probability candidate, the newline, is appended in each of the first two steps. Here is the 7th step:
GPT.forward() input string: “What is Machine Learning?\n\nMachine learning is a”
top 10 token candidates are:
probability: 32.40% word: new
probability: 15.23% word: technique
probability: 8.68% word: process
probability: 7.28% word: way
probability: 6.37% word: type
probability: 6.32% word: set
probability: 6.28% word: field
probability: 6.22% word: powerful
probability: 5.79% word: term
probability: 5.43% word: method
This continues until the maximum length of the response string has been reached.
My observation is that the candidate vector will be the same every time a given string is used as input with the same tokenizer, embedding, trained model and set of configuration parameters. That is, it is deterministic. If the candidate selection flag is set to top_k and the temperature setting mentioned by @gent.spah is not close to 1, the application will take the highest probability candidate and it will be the same every time to rerun the app.
Note the division by temperature in the code fragment above. Putting a small number in the denominator there alters the probability distribution returned by softmax and makes it more likely that other candidates can be chosen if not always choosing the single highest probability. The other option is to use sampling. Here is how my implementation does this…
if do_sample:
token_index_next = torch.multinomial(probs, num_samples=1)
else:
_, token_index_next = torch.topk(probs, k=1, dim=-1)
From this I hope you can see that if the configuration is set up to use the multinomial sample of candidates, and particularly when the temperature is above about 0.6, the system can return different tokens. But when the configuration is set to select the highest probability candidate, and nothing else is changed (tokenizer, embedding algorithm and training, vocabulary etc) then you can expect deterministic results.
NOTE that this doesn’t mean correct or accurate results, it just means repeatable. The results will recapitulate the model’s training in a predictable way, even if that training produces incorrect or inaccurate output.
Here’s the doc on torch.multinomial
Hope this makes some sense. Welcome feedback and opportunity to improve it.