How to evaluate In-Context learning/inferencing LLMS (ChatGPT)

When training large language models (LLM), a common way to evaluate their performance is to use in-context learning (ICL) tasks. These tasks require LLMs to complete sentences or answer questions posed in natural language without updating the model’s weights. The model must infer what the task is, figure out how the task works, and determine how to apply it to a new example, all by using contextual clues in the prompt — without ever having been trained to perform that specific task.

While traditional ML metrics like cross-entropy can be executed using a single PyTorch library function call, I am not sure how to evaluate In-Context learning/inferencing LLMS (ChatGPT). Here is what I think:

  • Evaluate the Model Qualitatively (Human Evaluation)
  • What else?

I guess it will depend on the task. In the course we saw two ways to evaluate models: ROUGE and BLEAU.

ROUGE is great to evaluate text summarization/generation
BLEAU is mainly used to evaluate translations

For other tasks, like Classification, you can use other common methods like Accuracy, Precision, Recall, F1-score, Confusion Matrix, among others.

Thank you very Juan,
As far as I understand, ROUGE and BLEAU are used to measure LLM performance when we Fine-Tune them (their parameters/weights are updated). When we train LLMs with in-context learning, we do not have access to their parameters/weights (they are frozen), and all we do is infer what the task is, figure out how the task works, and determine how to apply it to a new example, all by using contextual clues in the prompt — without ever having been trained to perform that specific task.

The problem/task I am working on requires multi-tasks LLM, more specifically: information extraction, reasoning, text classification, and question-answering.

I am looking for relevant evaluation metrics for LLMs’ in-context learning.

Thanks for the clarification @Awad_A_Younis_Mussa ! Regarding “in-context learning”, I am not sure if we could call this a real training in the sense that the model is not necessarily getting permanent learning. The learning will exist as long as the context exists. Once the context changes or is gone, so those that ‘learning’.

Having said that, measuring the quality of this ephemeral learning is also dependent on the task at hand and, again as the previous response, you can get the output of the model and run any of the metrics discussed, like ROUGE or BLEAU. Other metrics may include Human Evaluation.

But the most important thing in my opinion is to make sure that we know that since this is an ephemeral learning, the metrics’ results are also ephemeral - there’s no static or permanent learning.

Here is my final thaught:
ROUGE and BLEU metrics are commonly used in natural language processing tasks such as machine translation or text generation, where fine-tuning or training on specific datasets is involved. These metrics compare the generated text against reference texts or ground truth, which is typically used during training or fine-tuning stages to evaluate the model’s performance.

In the case of in-context learning, where the model adapts to a specific context without fine-tuning or access to hyperparameters, precision, recall, and F1 score are more appropriate evaluation metrics. These metrics assess the alignment between the model’s generated responses or mappings and the ground truth within the specific context or prompt window.

Precision, recall, and F1 score are widely used in information retrieval, text classification, and question-answering tasks, which is what the problem I am working on is. They evaluate the model’s ability to generate relevant outputs and capture the correct mappings or responses without explicitly considering the training or fine-tuning process.

Therefore, for evaluating the in-context inferencing or learning capabilities of ChatGPT (the LLM I am experimenting with in my toy project), precision, recall, and F1 score are more suitable metrics to assess the model’s performance within the given context. They focus on the relevance and correctness of the model’s outputs rather than comparing them against reference texts, which is more applicable in fine-tuning or training scenarios.

Thanks for the great summary, @Awad_A_Younis_Mussa !!!

I would like to share my reaction to your post:

I still think that ROUGE and BLEAU can be used with InContext learning for tasks of summarization and translation. The same way that you use Precision, Recall, F1 for your current tasks, then ROUGE and BLEU can be used for the other mentioned tasks. After all, the input for the evaluation is the output of the model, independent of whether that comes from a fine-tuning, or an InContext learning. Just keep in mind that InContext is ephemeral, but the quality of an output of InContext for a summarization task can still be ROUGE.

That’s my reaction :slight_smile: