The training dataset that we have here is very small for a problem this difficult. The result is that we get a model that happens to work surprisingly well on the specific dataset here, but does not “generalize” very well to arbitrary inputs.
Here’s a thread which discusses this in a bit more detail and shows some experiments with rearranging the training and test data.