C2_M3_Lab_2_embeddings - Suspected bug in the code

Hello,

There is a suspicious order of commands in the training_loop function of the “Training Embeddings with a Simple Model“ segment of the second lab of the module (please, see the attached screenshot below):

Usually a loss is accumulated before(!) gradients are calculated and updated, as you can see in the widely used regular training_loop function from helper_utils (please, see the attached screenshot below):

Please, let me know whether I am missing something, and the order of the commands is valid for this particular case?

Thank you!

It appears to me that “epoch_loss” is a local variable inside that function, and that it is not used by the backward propagation and gradient calculation.

So the order of that statement should not matter.

Technically, computing the gradient does not actually use the loss value - it only uses the mathematical equation for the gradients. That equation does not use the real-time value of the loss.

1 Like

Note to other mentors/staff:

I’m leaving this thread visible because it does not appear to discuss the graded portion of an assignment.

Got it! Thank you for the clarification, Tom! I am still learning intrinsic relationship between losses and gradients, and I wanted to be sure that “loss.backward()” step did not distort the accumulated loss calculations (“epoch_loss“).

But I have a short follow-up question about your statement: “That equation does not use the real-time value of the loss“. I thought gradient updates are based on real time values of the loss (they define “the direction“ of gradient updates), but you are saying they do not, aren’t you? So please, help me to resolve this my misunderstanding.

Sorry to interrupt. Please see this thread as well:

1 Like

No, it’s based on the gradients - not directly the loss.

The mathematical loss equation is used to derive the mathematical equation for the gradients. During training, to compute the gradients you only need the data set (the values of the features in each example), the weights and biases, and the true outputs and the predicted outputs.

During training, the loss is only used to provide a way to verify if the solution is converging to a minimum. Typically this would be via a loss history plot that gives the loss at each iteration.

Hm, it’s either some “terms“ misunderstanding/confusion or Internet AI (LLMs)’s brain completely does not agree with you:

Are the real time loss values used in calculating gradients during model training backpropagation step in PyTorch?

Yes, the real-time loss values are used directly to calculate gradients during the backpropagation step of model training in PyTorch. This is a fundamental aspect of how gradient-based optimization algorithms like Stochastic Gradient Descent (SGD) work.

Backward Pass (Backpropagation): The .backward() method is called on the calculated loss tensor. PyTorch’s autograd engine then automatically differentiates the loss with respect to all model parameters that have requires_grad=True. This process calculates the gradients for every single parameter.”

And BTW regarding my initial question/concern (again thanks to AI(LLMs)):
”Calculating the accumulated loss value is recommended before(!) calling loss.backward() because the backward pass computes and accumulates gradients in the leaf tensors (parameters). After loss.backward() is called, the loss tensor’s graph is no longer intact or usable for further gradient computation unless explicitly retained.
The order of operations is crucial in the training loop for the following reasons:

  1. loss.backward()
    The backward() method computes the gradient of the current loss with respect to all the model parameters that require gradients This operation relies on a computational graph that connects the loss value back to the inputs and parameters.
    Before: You can access the raw loss tensor to extract its scalar value (e.g., total_loss += loss.item()) because the graph structure is still defined.
    After: By default, the computational graph used to compute gradients is freed immediately after the call to loss.backward() to save memory. Therefore, trying to access or operate on the original loss tensor’s value afterward for accumulation might fail or yield unexpected results if you haven’t explicitly set retain_graph=True in the backward call.
  2. …”

Well, the question really is whether loss.backward() modifies the value of loss.item(). There’s a pretty easy way to experimentally determine that, right?

I agree with Tom that computing the gradient should just be evaluating a different function (the derivative of the loss) at the input value. Suppose our function is:

f(z) = z^2

Then we don’t need f(z) to compute:

f'(z) = 2z

right? Not to go all “math” on you :laughing:, but after all that is what we’re doing here.

Now there is another level of subtlety here: there are cases in which torch (or TF or whatever other platform you are using) does not actually have the derivative as a defined function. In that case, it does “autodiff” to compute the gradients using techniques based on finite differences. In that case, the actual values of the loss would be used in the calculations.

But in almost every case, the loss functions we use are well known and provided by the platform, right? You can define your own and I confess I’ve never really looked at the torch APIs for doing that, so I don’t know if you are required to also provide the derivative function. But there again, let’s run the experiment and see if loss.backward() actually modifies loss.item().

loss_item_before = loss.item()
loss.backward()
loss_item_after = loss.item()
assert(loss_item_before == loss_item_after)

That will either explode and catch fire or it won’t, right? :nerd_face:

To be fair, I grant that if it does not “throw”, that doesn’t really prove anything in the fully general case. But it’s an easy and accessible experiment to run and the answer will either be completely definitive (if it throws) or at least suggestive (if it doesn’t).

1 Like

I have dug deeper and clarified that the order of commands does NOT matter actually in this case (loss can be accumulated at any of mentioned places), so I am withdrawing my initial question/concern.

With my follow-up question I mostly confused myself by remembering how derivatives were programmatically calculated on my time (Paul reflected this way of thinking in his message (see “autodiff“ sentence), thank you).

Thank you everybody for helping me to resolve my confusion. And I am sorry for the false alarm.

1 Like

No, worries! Glad that we could help.

There is another important “meta” lesson here: outsourcing our reasoning to LLMs on subjects like math is not a good idea at least at this point in early 2026. They certainly sound like they know what they are doing, but it’s important to keep in mind that there is no real “knowledge” there. Just pattern recognition. That’s certainly a part of intelligence, but definitely not the whole enchilada. :nerd_face:

1 Like

Also just to follow up: I tried adding the “loss before/loss after” check to the training logic in C1 W4 A1. As I was expecting, it runs just fine. So the actual evidence agrees with our intuition that it shouldn’t be the case that running back propagation modifies the loss values at least in that one case.

1 Like

No, no! I have to defend them. The citations which I shared above is Internet AI (so I used word “Internet” intentionally in my message above), i.e. it’s from a quick browser search, if one asks actual AI chatbots (like ChatGPT, Gemini, Claude), he gets reliable answer with good and detailed explanation. I am sorry. that I tried to shortcut and used a simple Google search instead of more robust investigation of this topic.

I did it too :wink:

1 Like

Interesting. But I thought Google search is just Gemini these days …

Thanks for the additional information about how you ran your query, but I stand by my claim that (at least as of early 2026) LLMs are not reliable at math.

It’s worth following Gary Marcus on Substack to keep track of the “state of play” with LLMs.

2 Likes

Not for the sake of argument, but simply for fun: these guys have been better at math than me (… even with my double master’s degree in Physics and Math) for quite some time now: AI achieves silver-medal standard solving International Mathematical Olympiad problems - Google DeepMind

Yes, I have read that article as well and the work is incredibly impressive. But note that AlphaProof and AlphaGeometry are not LLMs (ChatBots). They include LLMs to handle some things, but the core logic that actually solves problems is a completely different architecture than a Transformer based LLM. The article does give some description of the architecture of the whole system that they put together to do this. Google doesn’t get enough credit in the AI space, but they are actually doing way more sophisticated things than the “pure LLM” shops. Maybe their previous iterations of chatbots have not been as impressive as gpt-n for various values of n, but they have solved some other important problems. AlphaFold is another great example.

I also have a masters in math, but that was quite a long time ago. :nerd_face:

2 Likes