I remember reading in one of DeepLearning.ai weekly updates Dr. Andrew Ng sends out or it was some random LinkedIn post he did.
In it he discussed how many times if we ask an LLM to solve some problem and then explain how it was solved it, the explanation may not match the solution.
Here’s a hypothetical example: you ask some LLM to solve this:
‘There is 1 apple on the table. Then a man brings a banana. Then the man eats the apple. How many fruits are left on the table? Please explain your work as well.’
The LLM’s completion:
'Explanation: There were two fruits and the man ate one, so there’s only 1 left.
Answer: Two fruits left.’
There is a way to reduce cases where the explanation doesn’t match the answer. Does anyone remember what the methodology is for that?