Benchmarking accuracy of various large language models

Hello everyone,

Hope everyone is doing great. I recently started LLMs and just wondering what mechanisms are being used to benchmark the accuracy of LLMs. I am specifically looking to fine tune a QnA application.


Recently launched a nice course on:
Evaluating and Debugging Generative AI

Please have a look at it. I think this is helpful for what you are looking for.

Hey Nydia,

Thanks for pointing to this course. This really helps to evaluate and debug parameters. Additionally I am looking for something like this.