To shuffle the data may be is a good idea, but why here?
Furthermore, BUFFER_SIZE is 10000 but in the dataset there are 25000 items: in this way (25000 > 10000) train_dataset will be resampled using only 10000 items. This is useful in some situation, but I don’t understand how this can be useful here…
It’s always good to shuffle the underlying dataset so that label ordering is made random. When training with a custom dataset which may be ordered by labels, you can work with rest of the same notebook by changing just the data source.
You are right that 10K buffer size is small for a 25K dataset as indicated here as well. This was just an example of using the shuffle api.