Andrew talks about ‘marking’ LLM output when fine tuning. How might these ‘numbers’ get back into the model to improve it?
Not mentioned?
In the batch, there was a nice article from DL.AI on RLHF:
RLHF basics: A popular approach to tuning large language models, RLHF follows four steps:
(1) Pretrain a generative model.
(2) Use the model to generate data and have humans assign a score to each output.
(3) Given the scored data, train a model — called the reward model — to mimic the way humans assigned scores. Higher scores are tantamount to higher rewards.
(4) Use scores produced by the reward model to fine-tune the generative model, via reinforcement learning, to produce high-scoring outputs.
In short, a generative model produces an example, a reward model scores it, and the generative model learns based on that score.
If you want to go deeper and gather some hands-on experience, you might wanna check out the GenAI course: Generative AI with LLMs - DeepLearning.AI
Best regards
Christian
Thanks Christian. That makes sense, building another model, hopefully atop the main one.
Andrew did not mention that.