I just discovered that the metrics used in model.fit(…), which is metrics=[‘accuracy’], may not be right for this setup. The reason is that token inputs and labels are both being padded, and a “dummy” target label of -100 is used wherever it was a [PAD]. I think the custom loss inside the model detect these -100 and zeros them from calculating the cost (and thus stop any gradient backprop). But the metric is just the vanilla tf keras metrics, and I don’t find any indication it is “smart” enough to ignore those -100 during its computation.
Note that while the Resume dataset are all pretty long and one may not have too many -100s to notice a problem, I did try this out on my own dataset of much shorter sentences, and debugged the poor accuracy even for a very tiny train set. I have come to suspect the wrong accuracy calculation is a big part of this.
I haven’t completed debugged this, and thought to post this, such that the mentors and course instructors can help if this is indeed a mistake.