[C2_W1_Lab02_CoffeeRoasting_TF] questions about copy/tile

How does tile/copy data reduces the number of training epochs? and why can’t we just copy/tile the data whenever we are overfitting models? Also, is Tile/copy our data a good way to increase our training set size in other models other than Neural Network?

2 Likes

Hey @Elvis_Lok,
Welcome to the community. I am assuming we are clear on what does the np.tile() function do, still, if you want to know more about it, you can check this out. It basically copies our data the given number of times.

Now, the above question is indeed nice, since it even stumped me for a minute, since there is no way copying the data would autonomously reduce the number of epochs, because epochs is set by us, right!

Epochs is defined as the number of times we want our model to train on the entire training set. Now, let’s say that we want our model to train on the entire training set for 50 epochs. But what if I simply copy my entire training set 10 times? Then essentially, in a single epoch now, we are training the model on our entire training set 10 times, and hence, we can reduce the number of epochs by 10, i.e., now we only need 5 epochs instead of 50. Similarly, in the lab, the training data has been copied over 1000 times, to simply train the model for 10 epochs instead of 10 * 1000 = 10000.

Now, the reason as to why it is done doesn’t seem so obvious to me, since the performance of the model in the 2 cases (200 samples with 10000 epochs and 200000 with 10 epochs) would only differ to a very small extent, that too, due to the inherent randomizations only, since the model essentially trains on any particular sample for the same number of times. The only reason I can think of is that perhaps the developers of the lab wanted to achieve a certain accuracy level, which was only possible with the order of 10000 epochs, but that would log the results 10000 times, since for every epoch, there is a single log, which might make the logs less interpretable, and hence, they simply copied the dataset 1000 times to reduce the number of logs by 1000 times.

I guess you would have understood why can’t we do this by now. For dealing with over-fitting, we can indeed try adding more training examples, but np.tile() only creates the copy of the existing samples, it doesn’t create any new samples

And I guess the above question doesn’t need to be answered any more. I hope this helps.

Regards,
Elemento

6 Likes

Thank you very much!

Is this really so, @Elemento ? I can not image the result of training 10 times with 1000 data set is the same as trainng 1 time with 10 *1000 data set (copy 1000 data set 10 times).

Hey @liyu,

In my opinion, it should be. Let me present you with another perspective, perhaps it might make more sense to you. Let’s say that we consider any training sample, and it contributes say a times to the training of the model when the model sees this example for one time. Now, if we train the model on the original dataset for 100 epochs, the contribution of this particular sample would be 100*a, since the model sees this sample for 100 times throughout the course of training. And if we train the model on the original dataset 10 times copied over for 10 epochs, then also the contribution of this particular sample would be 100*a, so essentially, both the cases, should lead to almost similar performance, with only some small differences, that too due to the inherent randomisation. Let me know if this helps.

Regards,
Elemento

Hi Elemento,

but how the model “see” the dataset might be, in my opinion, different.
In the above example:
in one case: the model makes 100 baby steps following dJ/dw_i towards the minimum.
in another case: the model makes 10 baby steps following dJ/dw_i towards the minimum. Although the J here covers 10 times more data.

I assume there might be a mathematic proof for this kind of question .

Thanks and Regards
Liyu

Hey @liyu,

Instead of a mathematical proof, we can run a simple experiment as well. Do try to run the model both ways, and observe the differences that you find in the model’s performance in both the cases. Please do share your results and conclusion with the community.

Cheers,
Elemento

image
If we look at the gradient of J, it seems like copying the same data does not make gradient of J larger. m becomes (100m) in the front, and the summation is 100 times larger, so overall the gradient is the same when we compared with not copying the data. This means for each epoch, the wj change by the same amount whether we copy the data or not. It isn’t too clear how it will speed things up.

Upon closer look, it looks like tf.keras.optimizers.Adam is used instead of gradient descent. Perhaps the property of Adam is different? So copying data will work for Adam, but not gradient descent?

Hello @inverted, welcome to our community!

The key is we DO NOT use all 100 copies in one round of gradient descent update. In each round, we use only 1 copy of the data, and in each epoch, there will be 100 rounds of update because we set the batch_size in model.fit(...) to be exactly 1 copy of the data.

I think with the vanilla gradient descent algorithm that we learn in this course, what copying the data can help is to reduce the number of the epochs (from 100 epochs to 1 epoch of 100 rounds), and consequently to reduce any overhead moving from an epoch to the next. For example, if we give the model.fit a set of validation data, then the calculation of that validation metric is an overhead reducible. Tensorflow may do something in between two epochs that generates some overheads too!

First, Adam is a more advanced variant of the vanilla gradient descent, so Adam is also a kind of gradient descent algorithm.

Second, the core of Adam is to replace a gradient with a moving average version of it.

If we train a total of 1 epoch with data that’s copied 10 times, since we use “one copy” as the batch_size in model.fit(...), we will have 10 updates, and the last update will use the moving average of gradients coming from the 10 copies.

On contrary, if we train a total of 10 epochs but we don’t copy to enlarge the data this time, then, again, we will also have 10 updates, and in the last epoch, we will, again, have the same moving average of gradients coming from the 10 copies.

Therefore, I think using Adam or the vanilla gradient descent won’t make a difference here.

However, if we use a learning rate scheduler, it WILL depend on the epoch number, so reducing the number of epochs will have some effects!

Cheers,
Raymond

Hey @inverted,
Just to add to @rmwkwok 's brilliant explanation, in this lab, we haven’t set the batch size, so, Tensorflow uses the default batch size of 32. But even in this case, Raymond’s explanation stays the same, since, if earlier, there were say 5 updates in a single epoch with a batch-size of 32, now, there will be 5000 updates in a single epoch, since we are copying the data 1000 times.

Cheers,
Elemento

Thank you @Elemento!