I noticed, that both subsets: “train_set” and “val_set“ in the “get_data_loaders_with_validation“ function (the “Training and Evaluation“ section of the lab) end up with the same transform – “transform_test“ – since the line “full_trainset = datasets.CIFAR10(root=‘./cifar10’, train=True, download=True, transform=transform_train)” just stores the transform function, “random_split” does not copy the data, and the line “val_set.dataset.transform = transform_test“ redefines the transform on the shared “full_trainset” object, i.e. for both “train_set” and “val_set“ subsets (please, see the attached screenshot).
Am I reading the code correctly? And if yes, Is that intentional choice or should different dataset objects be used (i.e. defined differently)?
no @DAResaid at one place where transform_train is being used for train_set,
the test is using the val_set transformer as it clearly mentions apply the nonaugmented test transform to the validation transform where probably train=False would be mentioned.
I am sorry, but I am not sure I got your “at one place where transform_train is being used for train_set“. At which “one place”? If you mean in the line “full_trainset = datasets.CIFAR10(root=‘./cifar10’, train=True, download=True, transform=transform_train)”, which I mentioned in my message, then by my understanding it does not help with my question/concern as I explained above.
So I would like somebody to review my arguments in my initial message about possible issue with the code, and either explain to me what’s wrong with them (with my arguments) or let me know what is the reason to have “train_set” and “val_set“ subsets with the same transform in this lab.
Wow, the OOP waters are getting pretty deep here and this is a very subtle point, which took some sharp understanding to recognize. It will require either a) some very careful study of the APIs being used or b) instrumentation using the python id() function to see whether the objects are actually the same. Or perhaps both. Well, I guess you could also try sampling the output of the train_set and see if you can detect whether the augmentations are actually happening or not, but that may not be so easy to discern.
I will take a swing at the id() strategy and see if I can get anywhere with that.
Ok, yes, you called it. I added the following lines to that function to show the object IDs:
# Split the full training set into separate training and validation sets.
train_set, val_set = random_split(full_trainset, [train_size, val_size])
print(f"id(train_set.dataset) = {id(train_set.dataset)}")
print(f"id(val_set.dataset) = {id(val_set.dataset)}")
# Apply the non-augmented test transform to the validation set.
val_set.dataset.transform = transform_test
print(f"id(train_set.dataset.transform) = {id(train_set.dataset.transform)}")
print(f"id(val_set.dataset.transform) = {id(val_set.dataset.transform)}")
Of course if the first two are equal, printing the second two is a foregone conclusion, but still. Here’s what I see when I run things with that in place:
So you’re exactly right that the point is that both train_set and val_set point back at the original dataset, so any assignment you make to the dataset of one will affect both.
It’s just a way deeper version of the point about python objects made on this ancient scroll recently excavated.
For anyone seeing this who hasn’t seen the id() function before, here’s what Gemini has to say:
The Python id() function returns a unique integer that serves as the identity of an object, which remains constant throughout that object’s lifetime. This identity is, in the most common Python implementation (CPython), effectively the object’s memory address, though it’s important to treat it as an opaque identifier rather than a direct memory pointer.
Now the question is to go back and look at the assignments/examples in C1. They do the game with different transforms all over the place. Do they have the same bug(s)?
I have just completed C1 and did not notice any similar problem. Please, look at the attached picture from C1_M3_Lab_data_management where subsets are handled properly.
All we know for certain is that the person who wrote that Markdown text in C1 was aware of the issue. That doesn’t cast any light on how C2 was created,.
actually this exercise i remember addresses the transformation of data in different ways. Now I remember when post creator shared the second screenshot of the codes where the Subsetwithtransform is used to recalled different set transformation to each set instead of recalling separately, Laurence infact in the videos mentions it, probably that’s the header has data management.
but still what @DAResaid is mentioning is actually noteworthy, atleast that full_trainset could be renamed with better name variable.
Thanks all. I’m looking into this. And this is indeed an incorrect behaviour. I’m working on the changes now.
The author of both these notebooks (C2M1 Lab 4 and C1 M3) was the same person, but this, if memory serves me right, was caught in C1, hence changes/clarifications were made.