After finishing this course, there’s one feat of these LLMs that is still a mystery for me. My question might be full of misconceptions, so please bare with me. When you chat with commercial LLMs, it’s quite obvious these models have somehow encoded very distinct facts. For example when I ask chatgpt about a very specific historical event that’s most likely not mentioned in more than 1 particular wikipedia page, It just knows the answer.
From my understanding of back-prop, a single training input (the sentence that contains this distinct knowledge) won’t have much effect in the learned weights of the model, so I wouldn’t expect the resulting model to “guess” the answer with such few inputs. Am I accurate in this observation, is yes how is this explained?
Undoubtedly that wiki page was included in the training set.
Then when you ask a very specific prompt, that sequence of words is going to be the most likely one it generates from its language model.
It’s just a very large and complex word-based probability machine.
Isn’t choosing based on highest probability called guessing? Anyway, my main question is how could a single input sentence be remembered by the model. Is this because of this sentence being fed to the model in multiple epochs?
No, including multiple identical copies of data into the training set does not add any new information.
If you use probabilities, it’s educated guessing. That’s how language models work. Based on the words you put in the prompt, it makes educated guesses as to the most likely words that should follow it.
No, including multiple identical copies of data into the training set does not add any new information.
I didn’t mean that. The same sentence will be fed to model in multiple training epochs right? And the model weights will be at a different state in each epoch, so the gradient for the same training input (our hypothetical sentence) will be different. I just can’t believe that’s enough for the recall performance I’m observing in commercial LLMs.
To clarify, I’m interested in the training of LLMs and not how it works. My guess is commercial LLMs having many parameters and their training involving many epochs, but that would be so vague.
Appreciate you giving summaries of basic concepts and correcting pieces of my question. However none of those answer my question. I just hope that my question make sense to somebody. To rephrase again: “After trained with hundreds of billions of inputs, how does an LLM perfectly recalls one unique input?”
It’s a good question and it lies at the heart of Language Models (LMs).
As Tom mentioned, LMs learn probabilities of some token based on other tokens (surrounding it (for encoders) or preceding it (for generators)). They can learn those probabilities because language is not random and it has some structure / information content (and it’s universal across different languages).
Large Language Models are “strong compressors”. In other words, they can learn “important” pieces of information. For example, in:
“…blah blah blah[Some King]blah blah blah[Some Battle]blah blah…”
they can store/associate important tokens efficiently so they can retrieve them later (because most of those “blah blah blah” are not very important and they can be represented in different words and even in different languages.
Another point is that LLMs do not “perfectly” recall unique inputs. Most of the time they sound “confident” and when they get the predictions/generation right - you might think that they “perfectly recall” or are very smart. But when they get the predictions/generation wrong - they look ridiculous (hallucinate) and you might not attribute that to “forgetting”.
So, you should go case by case (of those unique inputs that LLMs “perfectly recall”) and check why they got them right (it’s definitely not straightforward if you don’t have access to training dataset) and don’t forget the cases when they don’t “perfectly recall”.
Compression concept helped me clear out my confusion. I always saw attention mechanism as a way to emphasize important parts of the input, but fading out “blah blah” seems to be a useful concept for reasoning as well.