Top-k limits choices to the k( most probable tokens), i.e. closest to the output, while top-p picks from the smallest set of top tokens whose cumulative probability exceeds p, providing more dynamic, creative, and coherent, natural-sounding output.
yes your understanding is correct @Shavvy, but LLM capabilities especially based on probabilities on randomness can depend on other factors.
if temperature is not also set to 0, it scales the token probabilities before top-k or top-p filtering occurs. A high temperature can “flatten” the distribution, making several tokens nearly equally likely.
at extremely low values, minute differences in how a GPU calculates probabilities (floating-point errors) can occasionally flip which token is technically first.
if the sampling range is narrow, like here max_token is using random.randint() ) to pick from the final filtered set, a different seed will result in a different pick if more than one token remains in the pool leading to different output.
Another interesting reason as I mentioned earlier is hardware dependencies
one can encounter difference in llm output with when moving between different hardware setups due to differences in CUDA kernels or numerical rounding.
Top-K requires sorting logit vectors. On larger vocabulary models (more than 100k tokens), this sorting process causes significant overhead, and the efficiency of this operation is heavily dependent on optimized GPU kernels.
Different hardware like NVIDIA H100 vs. A100 vs. consumer GPUs uses different tensor cores and floating-point precisions (FP16, BF16, FP32). Small differences in calculating logit probabilities can cause the “top” tokens to shuffle slightly.
That’s why probably many researchers feel LLMss are intelligent because of hallucination