Output returned by model is set to float, but compared to ints

As I don’t have access to the github repo for this course, I will post this issue here:

function extract_number extracts the last number of the model’s generated output and converts it to float with the following code snippet:
return float(numbers[-1])

When the output of the model is compared to data stored in GSM8K it produces a Fail when a float output by the model is compared to an int in GSM8K:

109, Model: 109.0, Correct: False

13, Model: 13.0, Correct: False

It looks like scoring for exercise 3 is influenced by this:

Exercise 3

10/20
Overall Score: 0.50 (0.5/1 exercises) Exercise Results: FAIL ex3: 0.50 - Passed 1/2 tests - Function should achieve high accuracy on mock data (all questions should be answered correctly) FAILED!

Maybe someone with access to the github repo for this course could pass this on?

Thanks!

I don’t have access to that repo either, but I will ping the staff directly.

1 Like

In exercise 3, when you implement “evaluate_model_correctness,” if you “extract_number” from both answers, you can be comparing the correct type.

Yes, that’s one of a number of ways to make it work, but it conflicts with the statement that there should be ‘exact matches’ - which would be int when the ground truth is int. So the word ‘exact’ is misleading.

What can go wrong if you simply match text? Not all answers are integers, and exact string matching fails when different representations (like .1 and 0.1) represent the same value. Matching at the float level allows us to verify the answer regardless of formatting. While not perfect, this avoids complicated code while remaining effective for educational purposes.

I agree that comparing floats and ints is not a great practice.

Hi bong.seog.choi,

This is precisely my point. The learner is prompted in the following way:

"Exercise 3

In this exercise you will:

Generate responses for each problem
Extract numerical answers from both model output and ground truth
-> Compare the two answers for exact matches
Calculate overall accuracy

Comparing the two answers for exact matches would imply that a user use sample[‘answer’] rather than passing the ground truth through an additional function (extract_number) that converts the ground truth into a float.

It was confusing to me as a learner/tester. So I would suggest resolving this one way or another to avoid such confusion for other learners. But it’s not up to me to decide on this point.

I have submitted a github issue for this, so we’ll eventually see what the staff thinks.

Thanks Tom!

I have raised the issue with the team. Thanks Reinoud.

1 Like

Thanks for elaborating. I agree that the problem statement is not clear enough.