Quiz - week3 - RLHF reward hacking - end of video quiz - interpretability

The end of lesson quiz in https://www.coursera.org/learn/generative-ai-with-llms/lecture/Cux3s/rlhf-reward-hacking suggest the following:

RLHF can enhance the interpretability of generated text

Correct
By involving human feedback, models can be tuned to provide explanations or insights into their decision-making processes, improving interpretability and allowing users to better understand the model’s outputs.

However, that’s not mentioned anywhere in the course and it’s not really clear how. I wasn’t able to find a answer using Google either, and GoogleBard/ChatGPT4 responses don’t indicate that’s directly true.

Could I get clarifications on this please

Thanks,
B

I did a quick google search using the text “RLHF can enhance the interpretability of generated text” and I was able to get the string that was mentioned. May be try using the google search. let me know if it doesn’t work out for you. In that case, please connect with one of the mentors for the course to get guidance.

I tried again, using your search phrase as well, and didn’t get results.
One link mentioned it almost verbatim[1] without going into details of why that’s the case

[1] https://www.larksuite.com/en_us/topics/ai-glossary/rlhf-reinforcement-learning-from-human-feedback

@TMosh can you help here?

Sorry, I can’t, I’m not a mentor for that course.

@gent.spah if you can help?

Well it may not be mentioned per se in the course but its natural outcome of the human factor involved, it will steer the output to a more human related outcome.

You could possibly try with and without RLHF asking models, but using Human Feedback you can steer the model any direction you want that the main point here!

1 Like

Thanks for your response, but respectfully, for lack of better phrasing, a bit hand-wavy. My interpretation of the statement “models can be tuned to provide explanations or insights into their decision-making processes, improving interpretability and allowing users to better understand the model’s outputs” is much wider - initially leading me to think that it might have linkages to xAI (explaninable AI) - which was what led me to this question in the forum - “if this is really meaning to say xAI, then how?”

Based on the response above, it seems it’s not related to xAI, and “insights into their decision making processes” probably meant to refer to “insights into the possible reward model used to tune it”

1 Like