Yes, the dataset we have here is very small for a problem as complex as this, so we don’t get a model that “generalizes” very well at all. They did it that way because of the limitations of the online notebook environment here to keep the cpu usage of the training down to a tolerable level.
In fact, you can turn the question around: why does it even work as well as it does with a mere 209 training samples? It turns out the dataset if very carefully curated to get halfway decent results. Here’s a thread which runs some experiments to show that.