Hi,
I am learning Language models, I have a question on how the algorithm detects the first word - is it only based on the loss function? For a first word, there are few words which can have almost same probability of being the first word?
if the llm is about only detecting the first word, indexing should help but then you asked about probability of the words in a given sequence, then you could use transformer like model where the focus need to be first word in a given corpus/sequence.