ok this actually helps where your code might be going wrong and actually the link I shared had the same issue.
Two issues clearly visible and actually is mentioned in the link I shared earlier, don’t use training as true parameter while passing information from block layers to multihead attention, read again the instructions carefully, training=training is only instructed to use at one place.
next your prediction ID code is incorrectly sequenced, you never apply first model then give output and input. any model should always follow input, output, model generalization rule.
this is the post comment from the same link I shared earlier should address the issue
but I doubt you might have more error, so let us know.. please take your time again start from first exercise 1, read every instruction, see what you might be missing with the code you wrote. remember some of the unittedt don’t always catch all the variability in text generation as the codes might be write but it’s implementation creativity might be incorrect causing you error in latter exercise.