L9 Evaluation Part II - An example that GPT works poorly with fact-checking

In the latter part where GPT is employed to contrast the ideal and GPT outputs, it is evident from the given example that the GPT model struggles with effective fact-checking (as several numbers do not align). Moreover, the GPT appears to prioritize the processing of initial numbers over the latter ones.

Code example:

assistant_answer_3 = "Sure, I'd be happy to help! The SmartX ProPhone is a powerful smartphone with a 6.1-inch display, 128GB storage, 12MP dual camera, and 5G capabilities. The FotoSnap DSLR Camera is a versatile camera with a 24.9MP sensor, 1081p video, 5-inch LCD, and interchangeable lenses. As for TVs and TV-related products, we have a variety of options including the CineView 4K TV with a 54-inch display, HDR, and smart TV capabilities, the CineView 8K TV with an 8K resolution and a 53-inch display, and the CineView OLED TV with a 55-inch display and true blacks. We also have the SoundMax Home Theater system with a 5.2 channel and 1100W output, and the SoundMax Soundbar with a 2.2 channel and 400W output. Do you have any specific questions about these products or are you looking for any particular features?"

eval_vs_ideal(test_set_ideal, assistant_answer_3)

Check where is the difference compared with the original prompt:

The output of the evaluation is A

It is a language model. Not a fact machine.

Hi there, yes it is an instructed LLM and not a fact machine. I’m giving an example that GPT has this limitation and one needs to be careful when deploying to production. In the customer service domain, the chatbot needs to give accurate answers to the users. And I believe that’s why we emphasize evaluation here in this short course.