In the first lab of Module 1 (C2_M1_Lab1_tuning_and_metrics.ipynb) I encountered a behavior that is not a bug but can confuse people.
In this cell:
#CHECK YOUR IMPLEMENTATION
#model
model = SimpleCNN().to(device)
#dataloaders
train_dataloader, val_dataloader = helper_utils.get_dataset_dataloaders(batch_size=128)
accuracy, precision, recall, f1 = evaluate_metrics(model=model, val_dataloader=val_dataloader, device=device)
print(f"Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")
One might expect it to return the same results everytime it’s run. However, it does not, because each time it is called:
get_dataset_dataloaders loads 10000 random images
new model with random parameters is initialized
Even when ‘evaluate_metrics’ is implemented correctly, the results can significantly differ from the ‘Expected Output’. (In my case it’s sometimes 0.04 precision vs 0.01 expected or 0.048 F1 vs 0.018 expected)
I would suggest setting a random seed in this cell to ensure reproducible behavior.
i remember the instructor mentioning the results will not be same for everyone running the codes in the labs, so the expected result is probably the value shared when the codes were run by developers and doesn’t state it to be the only correct value.
The random seed is in place, but in the first cell. So if you run all the cells up to this point in a single continuous sequence, you’ll get the same output as the expected output.
But if you re-run the data loading cell, without first (re)setting the random seed, it is expected that your outputs will differ, which, as you said, is not a bug.
This is why it says, #### Expected Output (approximate values):.
Part of the reason why we don’t want to have definitive results every time is to show that results can be different. You might find this behaviour confusing. Let’s say, I set up the random seed in the manner you are suggesting, and the results are the same every time. Then, if someone runs this lab outside of the platform, with different library versions, their definitive results would be different, so they would then become confused why they aren’t getting the same results as we mentioned.