In the first expression, a * b is int32 multiplication. In the second expression, a * b is float32 multiplication. I’m guessing that the type differences are messing up the gradients. From my understanding the gradient tape is watching for operations, so the order of operations and casting matters for backprop.

I didn’t do a deep dive, but the documentation on Tensorflow type promotion mentions dunder (double underline) operations where the math goes wrong due to bit-widening.

Did you read my example thread (linked several times earlier on this thread) where I showed some of the possible ways to get errors here?

Yes! That thread helped me get to a successful submission for my assignment.

Thanks for adding the point about the potential effect on the gradients. I had not thought of that. Note that none of the variables in question here are mutable, so are not directly affected by backprop, but they would be factors. You can see in my example thread that I used numpy or straight python for the integer arithmetic pieces in some of the formulations and it all still works fine. Normally if you insert a numpy operation anywhere in the compute graph that matters, then it “throws” in an obvious way at `gradient.tape`

time. E.g. even if you do something as simple as use `np.transpose`

where you should use `tf.transpose`

it will fail. Here’s an example of that from DLS C5 which also points to another case in DLS C2 W3.