M4.4: Ungraded Lab - Not equal results when top_k and top_p are 0

In the lab “Exploring LLM Capabilities” in section 3.3 and 3.2,

The outputs of the function with both top_k and top_p set to 0 should result the same equal outputs, but they don’t.

In the picture above you can see that the second Call output is different:

Response: RAG (Retrieval Augmented Generation) is an AI technique

from the first call:

Response: RAG (Retrieval-Augmented Generation) is a technique

Is it expected?

hi @Shavvy

Top-k limits choices to the k( most probable tokens), i.e. closest to the output, while top-p picks from the smallest set of top tokens whose cumulative probability exceeds p, providing more dynamic, creative, and coherent, natural-sounding output.

Both still hold almost similar output.

But in the explanation written “the same”, so is it a mistake?

top_p and top_k should act the same when they are both 0,

they should pick the token with the most probability no?

yes your understanding is correct @Shavvy, but LLM capabilities especially based on probabilities on randomness can depend on other factors.

if temperature is not also set to 0, it scales the token probabilities before top-k or top-p filtering occurs. A high temperature can “flatten” the distribution, making several tokens nearly equally likely.

at extremely low values, minute differences in how a GPU calculates probabilities (floating-point errors) can occasionally flip which token is technically first.

if the sampling range is narrow, like here max_token is using random.randint() ) to pick from the final filtered set, a different seed will result in a different pick if more than one token remains in the pool leading to different output.

Another interesting reason as I mentioned earlier is hardware dependencies

  1. one can encounter difference in llm output with when moving between different hardware setups due to differences in CUDA kernels or numerical rounding.

  2. Top-K requires sorting logit vectors. On larger vocabulary models (more than 100k tokens), this sorting process causes significant overhead, and the efficiency of this operation is heavily dependent on optimized GPU kernels.

  3. Different hardware like NVIDIA H100 vs. A100 vs. consumer GPUs uses different tensor cores and floating-point precisions (FP16, BF16, FP32). Small differences in calculating logit probabilities can cause the “top” tokens to shuffle slightly.

That’s why probably many researchers feel LLMss are intelligent because of hallucination :smirking_face::slightly_smiling_face:

regards

Dr. Deepti