L9 Evaluation Part II - An example that GPT works poorly with fact-checking

shawn · June 25, 2023, 4:31pm

In the latter part where GPT is employed to contrast the ideal and GPT outputs, it is evident from the given example that the GPT model struggles with effective fact-checking (as several numbers do not align). Moreover, the GPT appears to prioritize the processing of initial numbers over the latter ones.

Code example:

assistant_answer_3 = "Sure, I'd be happy to help! The SmartX ProPhone is a powerful smartphone with a 6.1-inch display, 128GB storage, 12MP dual camera, and 5G capabilities. The FotoSnap DSLR Camera is a versatile camera with a 24.9MP sensor, 1081p video, 5-inch LCD, and interchangeable lenses. As for TVs and TV-related products, we have a variety of options including the CineView 4K TV with a 54-inch display, HDR, and smart TV capabilities, the CineView 8K TV with an 8K resolution and a 53-inch display, and the CineView OLED TV with a 55-inch display and true blacks. We also have the SoundMax Home Theater system with a 5.2 channel and 1100W output, and the SoundMax Soundbar with a 2.2 channel and 400W output. Do you have any specific questions about these products or are you looking for any particular features?"

eval_vs_ideal(test_set_ideal, assistant_answer_3)

Check where is the difference compared with the original prompt:

prompt-difference-example - Diff Checker

The output of the evaluation is A

TMosh · June 25, 2023, 4:50pm

It is a language model. Not a fact machine.

shawn · June 25, 2023, 5:55pm

Hi there, yes it is an instructed LLM and not a fact machine. I’m giving an example that GPT has this limitation and one needs to be careful when deploying to production. In the customer service domain, the chatbot needs to give accurate answers to the users. And I believe that’s why we emphasize evaluation here in this short course.

Topic		Replies	Views
L6 Checking outputs - Checking passed even the answer is insufficient Building Systems with the ChatGPT API	1	81	June 8, 2023
L9 Evaluation II Inconsistency in result - Getting 'D' where I should get 'A' Building Systems with the ChatGPT API	5	203	February 6, 2024
L8 - The Lab output differ from results in Video Building Systems with the ChatGPT API	0	89	July 5, 2023
L7 Evaluation: Utils.py - A bit of prompt refinement needed II Building Systems with the ChatGPT API	0	133	March 4, 2024
Context Window (memory) Generative AI with Large Language Models week-1	6	533	June 30, 2023

L9 Evaluation Part II - An example that GPT works poorly with fact-checking

Related topics