How to evaluate In-Context learning/inferencing LLMS (ChatGPT)

Awad_A_Younis_Mussa · July 13, 2023, 12:36am

When training large language models (LLM), a common way to evaluate their performance is to use in-context learning (ICL) tasks. These tasks require LLMs to complete sentences or answer questions posed in natural language without updating the model’s weights. The model must infer what the task is, figure out how the task works, and determine how to apply it to a new example, all by using contextual clues in the prompt — without ever having been trained to perform that specific task.

While traditional ML metrics like cross-entropy can be executed using a single PyTorch library function call, I am not sure how to evaluate In-Context learning/inferencing LLMS (ChatGPT). Here is what I think:

Evaluate the Model Qualitatively (Human Evaluation)
What else?

Juan_Olano · July 13, 2023, 1:07am

I guess it will depend on the task. In the course we saw two ways to evaluate models: ROUGE and BLEAU.

ROUGE is great to evaluate text summarization/generation
BLEAU is mainly used to evaluate translations

For other tasks, like Classification, you can use other common methods like Accuracy, Precision, Recall, F1-score, Confusion Matrix, among others.

Awad_A_Younis_Mussa · July 13, 2023, 3:12am

Thank you very Juan,
As far as I understand, ROUGE and BLEAU are used to measure LLM performance when we Fine-Tune them (their parameters/weights are updated). When we train LLMs with in-context learning, we do not have access to their parameters/weights (they are frozen), and all we do is infer what the task is, figure out how the task works, and determine how to apply it to a new example, all by using contextual clues in the prompt — without ever having been trained to perform that specific task.

The problem/task I am working on requires multi-tasks LLM, more specifically: information extraction, reasoning, text classification, and question-answering.

I am looking for relevant evaluation metrics for LLMs’ in-context learning.

Juan_Olano · July 13, 2023, 1:08pm

Thanks for the clarification @Awad_A_Younis_Mussa ! Regarding “in-context learning”, I am not sure if we could call this a real training in the sense that the model is not necessarily getting permanent learning. The learning will exist as long as the context exists. Once the context changes or is gone, so those that ‘learning’.

Having said that, measuring the quality of this ephemeral learning is also dependent on the task at hand and, again as the previous response, you can get the output of the model and run any of the metrics discussed, like ROUGE or BLEAU. Other metrics may include Human Evaluation.

But the most important thing in my opinion is to make sure that we know that since this is an ephemeral learning, the metrics’ results are also ephemeral - there’s no static or permanent learning.

Awad_A_Younis_Mussa · July 13, 2023, 11:15pm

Here is my final thaught:
ROUGE and BLEU metrics are commonly used in natural language processing tasks such as machine translation or text generation, where fine-tuning or training on specific datasets is involved. These metrics compare the generated text against reference texts or ground truth, which is typically used during training or fine-tuning stages to evaluate the model’s performance.

In the case of in-context learning, where the model adapts to a specific context without fine-tuning or access to hyperparameters, precision, recall, and F1 score are more appropriate evaluation metrics. These metrics assess the alignment between the model’s generated responses or mappings and the ground truth within the specific context or prompt window.

Precision, recall, and F1 score are widely used in information retrieval, text classification, and question-answering tasks, which is what the problem I am working on is. They evaluate the model’s ability to generate relevant outputs and capture the correct mappings or responses without explicitly considering the training or fine-tuning process.

Therefore, for evaluating the in-context inferencing or learning capabilities of ChatGPT (the LLM I am experimenting with in my toy project), precision, recall, and F1 score are more suitable metrics to assess the model’s performance within the given context. They focus on the relevance and correctness of the model’s outputs rather than comparing them against reference texts, which is more applicable in fine-tuning or training scenarios.

Juan_Olano · July 13, 2023, 11:20pm

Thanks for the great summary, @Awad_A_Younis_Mussa !!!

I would like to share my reaction to your post:

I still think that ROUGE and BLEAU can be used with InContext learning for tasks of summarization and translation. The same way that you use Precision, Recall, F1 for your current tasks, then ROUGE and BLEU can be used for the other mentioned tasks. After all, the input for the evaluation is the output of the model, independent of whether that comes from a fine-tuning, or an InContext learning. Just keep in mind that InContext is ephemeral, but the quality of an output of InContext for a summarization task can still be ROUGE.

That’s my reaction

Topic		Replies	Views
Advanced LLM evaluation techniques Generative AI with Large Language Models week-2	2	750	July 28, 2023
Week 2 - Fine-tuning and LLM Evaluation in practice Generative AI with Large Language Models week-2	1	429	July 27, 2023
Evaluation methods for a specific task Generative AI with Large Language Models week-2	3	422	July 20, 2023
Metrics QA LLMs Generative AI with Large Language Models week-2	1	365	October 30, 2023
Couple Questions From Week 1 Generative AI with Large Language Models week-1	1	383	October 2, 2023

How to evaluate In-Context learning/inferencing LLMS (ChatGPT)

Related topics