Thanks for adding these to the conversation. I notice that one layer down from the llama code you linked above, sample_top_p calls torch.multinomial , which is consistent with the older GPT-2 style code I have. Also, the effect of low T on softmax you derived is consistent with my anecdotal evidence graphed on my linked thread about Temperature. You can clearly see the shift towards a single top candidate well before T approaches 0.
I started learning GPT-2 using a tensorflow and keras implementation I found on the web from François Chollet, but switched to pyT when I had environmental incoherence and couldn’t resolve in a reasonable time. Now that I have a working pyT environment I should probably bring down the llama code and tinker with it. Thanks for the impetus.