Question about Week 1 Exercise 6: length_of_training_dataset

In Exercise 6, it requires to get the length of the training_dataset and validation_dataset. Instruction already said getting lengths from dataset won’t work. What would be the API to get the length of the datasets? I tried len(list(training_dataset)), but it took a long time and eventually run out of memory. Another approach I can think of is to keep the info returned from tfds.load(), and use info.splits[‘train’].num_examples. But since “info” has been discarded in get_visualization_training_dataset(), this doesn’t seem to be a good solution, either. Any suggestion about how to get the length of training_dataset.

The solution is pointed out to you in section 4.1, if you read carefully it tells you exactly what to do.

Hi,
Thanks. Yes, I tried len(visualization_training_dataset) and it works. Then I’m curious why we can get len(visualization_training_dataset) but not len(training_dataset).

I added print(len(data_set)) at each step in get_training_data(dataset), and find that the length can’t be obtained after dataset.repeat(). I guess that is because the dataset will be as if it was infinite when it repeats.

Another question I have is, in training dataset, we do dataset.repeat() before dataset.batch(), but in validation dataset, we do batch first. Is there any particular reason for this?

Thanks.

I think so too because of the transformation in that function.

I think the order in which those are placed doesnt change much, but you can try changing the order and see if you see any difference in the output.

Ultimately it depends how those tf functions transform and output the data and one has to go through their implementation (it can be found in github).

1 Like