L9 Evaluation II Inconsistency in result - Getting 'D' where I should get 'A'

Jayachithra · August 19, 2023, 12:10pm

Just running the notebook without changing anything. When comparing ideal answer with the model generated answer, the score that I get from the model is ‘D’ however, according to the lecture and based on the answer, I would expect an ‘A’. Did something change?

Alexio_Cassani · August 21, 2023, 4:35pm

Same here. I’ve asked to explain the reason and this is the answer:

Blockquote D) There is a disagreement between the submitted answer and the expert answer. \n\nExplanation: The submitted answer provides some information about the SmartX ProPhone and the FotoSnap DSLR Camera, but it does not include all the details mentioned in the expert answer. The expert answer provides specific features, such as the 12MP dual camera for the SmartX ProPhone and the 24.2MP sensor for the FotoSnap DSLR Camera, which are not mentioned in the submitted answer. Additionally, the expert answer provides information about the price and warranty for both products, which is missing in the submitted answer. Therefore, there is a disagreement between the two answers.

That is totally untrue of course…

symeneses · August 31, 2023, 3:33pm

I did the same and I got similar explanations.

‘The selected choice is (D) There is a disagreement between the submitted answer and the expert answer. \n\nThe submitted answer provides some information about the SmartX ProPhone and the FotoSnap DSLR Camera, but it does not include all the details mentioned in the expert answer. The expert answer provides specific features such as 5G wireless, 128GB storage, and a 12MP dual camera for the SmartX ProPhone, while the submitted answer only mentions the 6.1-inch display, 128GB storage, and a 12MP dual camera. Similarly, for the FotoSnap DSLR Camera, the expert answer mentions features like 1080p video, a 3-inch LCD, and interchangeable lenses, which are not mentioned in the submitted answer. Therefore, there is a disagreement between the submitted answer and the expert answer in terms of the details provided about the products.’

foo · November 5, 2023, 12:30pm

Same here. I tried to swap ‘D’ with ‘E’ in prompt and then I am getting ‘C’ now

"""
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
    The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
    (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
    (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
    (C) The submitted answer contains all the same details as the expert answer.
    (E) There is a disagreement between the submitted answer and the expert answer.
    (D) The answers differ, but these differences don't matter from the perspective of factuality.
  choice_strings: ABCDE
"""

dereklei · January 26, 2024, 7:39am

Just responding to say that I’m getting the same thing here. Of course like responders before me, I also asked ChatGPT for an explanation and got the same answer.

I’ve learned here not to fully rely on LLM to do these reasoning checks… at least not ChatGPT 3.5… maybe better in 4?

ramz · February 6, 2024, 6:44pm

I didn’t get a better response using gpt-4 model - it’s not the latest and uses 0613 but it still should be better than 3.5. I have noticed the quality of ChatGPT decreasing over the last year although this is just a gut feeling from using it. Perhaps that’s a factor since the course was originally released months ago.

Topic		Replies	Views
Contradish tests whether AI models provide consistent answers when given semantically equivalent prompts rather than just measuri AI Discussions ai-discussions	0	16	March 24, 2026
L6 Checking outputs - Checking passed even the answer is insufficient Building Systems with the ChatGPT API	1	99	June 8, 2023
ABCDE scoring of factual content Building Systems with the ChatGPT API	1	98	August 28, 2023
L4 Chain of thought reasoning Building Systems with the ChatGPT API	5	192	July 25, 2023
L9 Evaluation Part II - An example that GPT works poorly with fact-checking Building Systems with the ChatGPT API	2	127	June 25, 2023

L9 Evaluation II Inconsistency in result - Getting 'D' where I should get 'A'

Related topics