Summary:
There is an inconsistency between the expected parameter count and the implementation of LayerNorm in the Decoder. When LayerNorm is defined inside __init__ (preferred PyTorch practice), the total number of trainable parameters does not match the expected count from the unit tests. However, when LayerNorm is instantiated directly inside the forward method, the parameter count matches and unit tests pass — but this leads to runtime errors during training.
Details
Approach 1 (Preferred / Standard)
self.ln_final = nn.LayerNorm(d_model)
...
output = self.output_projection(self.ln_final(decoded))
- Correct PyTorch pattern (registered module)
- Parameter count does not match expected value
- Unit tests fail due to mismatch
Approach 2 (Inline in forward)
output = self.output_projection(nn.LayerNorm(self.d_model)(decoded))
- Parameter count matches expected value
- Unit tests pass
Fails later during training with device error:
Observed Behavior
-
Defining LayerNorm in __init__ increases parameter count (as expected due to learnable affine params), but conflicts with test expectations.
-
Defining it in forward avoids registration, so parameters are not counted — aligning with tests but breaking training due to:
Questions / Clarification Needed
-
Is the expected parameter count in the unit tests excluding LayerNorm parameters intentionally?
-
Should LayerNorm be:
-
Included but configured differently (e.g., elementwise_affine=False)?
-
Omitted entirely from this architecture?
-
Is there a specific architectural constraint in this assignment that differs from standard Transformer decoder implementations?
-
The correct way to implement normalization is in init but fails unit tests, is there a correction needed or there’s more debugging needed?
have you selected the right course and module number for the topic you created?
because when I checked c1 m3 assignment, exercise 2 unittest isn’t about decoder, so be careful when you post your query, if you are confused or unsure of not able to select the right category, mention the assignment name and course name in the description.
as far as I remember such failed test can happen if one of the previous grade function must have had a global/local variable.conflict.
or check if you are calling parameters by variables name, i.e.
parameter name=variable name, this way you are assigning the values to parameters based on the grade function exercise and not assigning global variable values.
Regards
Dr. Deepti
As Dr. Deepti says, we’re not really sure what assignment you are referring to here. But here are a couple of general comments:
That error has nothing to do with parameter counts. It is a device mismatch. You need to make sure that all pytorch operations involving multiple operands have the operands on the same device.
On the later case in which the parameter counts don’t match, the general point to make is that if you pass one test but then fail a different test, the first thing to look for is some way in which your code is not general. For example, you might be hard-coding some of the dimensions or referencing global variables instead of the formal parameters in your functions.
1 Like
Thank you, Dr. Deepti for a quick response.
This is from C3_M3 assignment in course pytorch-for-deep-learning-professional-certificate. Apologies, I mapped week module 3 wrongly to this one.
The way notebook and unittests are structured, it doesn’t appear to have global variable overlaps.
However, I tried executing exercise_2 again without defining the Encoder in the exercise_1 which does have a few similar parameter names, but I am still seeing the parameter count difference.
I was able to proceed with inline definition of nn.LayerNorm(self.d_model) for submission.
And per @paulinpaloalto ‘s suggestion in subsequent comment, I could also train the model by fixing the inline device with nn.LayerNorm(self.d_model).to(decoded.device).
Still worth debugging further over where init’s LayerNorm’s mismatch is coming from. I’ll pick it up back later once i finish this module.
Thank you!
@parulgaba
you can use
and then move your topic to the right course category. if you can’t do, then let me know, I will move it.
as for code debugging, if you are still stuck, then DM code screenshot to either me or paul for review.
regards
Dr. Deepti
1 Like
Hello @Deepti_Prasad and @paulinpaloalto,
Thank you for your help earlier with this module. I was able to proceed with the exercise and assignment, but I was still curious to close the loop on it.
I’ve revisited the parameter mismatch issue in the Decoder module today. I found that defining the final LayerNorm in __init__ tracks \gamma and \beta, which increases the parameter count beyond what the unit test expects.
Interestingly, calling nn.LayerNorm inside the forward block, the test passes, but the validation loss stays high (~1.419). When lnorm is properly registered in __init__, the loss drops to ~1.259.
It appears the anonymous declaration prevents the optimizer from updating the normalization parameters because the layer isn’t properly registered into the Decoder Architecture. Does the exercise intend for us to omit this final LayerNorm to match the parameter count, or is the test designed for a specific architecture that excludes these learnable parameters? I think it is the former, but please confirm.
Best,
Parul