Test fine tuned llm against TruthfulQA dataset

After fine tuning my model, I would like to test its performance against the TruthfulQA dataset.
Is there anyone who has done this?

Or any resource on how to go about it?

It is a general problem for which you need to prepare some test set to work on fine tuned model. Evaluatioin is not straight but some index like BLUE score may be used. You can search like “Evaluation of fine tuned model”, or “BLUE score”.

Thank you. I see many modes on hugging face that claim some evaluation results with TruthfulQA dataset. I want you perform the same evaluation for my model.

I came across langtest but Colab easily runs out of memory on a T4 instance when running langtest. My model is based on Mistral 7B