Near the end of this lab, where we visualize the attention via a heat map, I got the below plot. Part of it does not make intuitive sense to me, namely the part where we see that the characters in “Tuesday” have a big impact on the “-10” part of the output sequence. But the output does not include the day of week, and a similar example visualization earlier in the lab shows that the “day of week” has no attention (influence) at all for any of the output. Thinking creatively, perhaps this network is deeper than I thought, and can use the day of week as a clue to the correct year, month or day of month in cases where the year, month or day of month are otherwise ambiguous. (A specific date by definition can only occur on a specific day of the week.) However, in this case, I’d expect the earlier plot to also show this influence. (And I’d also expect that influence to apply to day of month and year as well, not just to month.) I’m including that earlier plot further below. Any ideas?
Thanks @Elemento! Indeed, mostly same issues raised in that thread. I see plot thresholding is at least a partial factor… but there’s a few important details which I did not see addressed:
Does the data depicted in plot rows need to sum to 1? Based on the lecture notes, we see that taking (sum over t’) of alpha^<t,t’> equals 1 for any given t. If this is true, I expect that summing from left to right equals 1 for any given row. If this is true, the plot seems to be at odds with this rule. The bottom row (for ‘9’) has very little dark blue in it while the fifth row (for ‘-’) is almost entirely dark blue. Or is clipping interfering with my ability to visually judge the plot?
I understand that the “Tuesday” letters influence the first dash but it is not intuitive to me why they do not influence other output values. Any ideas on this?
Yes, you are correct regarding this. If the plot threshold is kept high, then in that case, you will find that the plot shows some part of the attention map, i.e., the ones which are above the threshold. But this doesn’t contradict the fact that the sum of the attention weights in each row must be 1.
As for this, it is hard to say, since, there is no reasoning given by the model as to how it transforms the date from one format to another. We can understand only the things which overlap with our intuition, but it’s hard to give the reasoning about the others. I hope this helps.