TensorFlow epoch questions

I am not clear about the purpose for epoch in TF training phase. Even after reading the description, it is the rounds of applying the whole training set.

  1. Between each epoch, is learning_rate changed? (E.g. smaller learning steps)? If so, then make sense to me.

  2. If the cost function is convex, one epoch should be able to find the global minimum if learning_rate is small.
    So not sure what’s the value for epoch in that case.

  3. Are the later epoch phase’s initial parameters the same as the previous round’s trained parameters?

Thanks.

1 Like
  1. The learning rate does not change.
  2. No, one iteration isn’t sufficient. Whether the cost function is convex does not matter with regard to how quickly the minimum is found.
  3. On each epoch, the weight values are updated, and are used on the next epoch.
3 Likes

Thanks for the reply.

Still not clear about 2).

Assume a simplified case,
- single Dense with single-unit.
- sigmoid activation functions
- standard logistic cost function (which is convex)
- standard gradient descent optimizer

Then, if a not over-large learning rate is used, then NN should be able to reach the global minimum in one epoch? (Just like something we learned from the previous courses?)

Or maybe there is some constraint about max iterations for the optimizer within each epoch?

Thanks. (Since I am just beginner, maybe I missed something from the course)

1 Like

Hello @Lizhang_Qin, let’s go back to the basic.

Each time, gradient descent moves a weight by this amount: \alpha \frac{\partial{J}}{\partial{w}}, so for it to be moved to the minimum within an epoch, we need (1) enough # of moves, (2) right size of learning rate \alpha and (3) right gradient size. Before I go into the (1-3), let’s look at the following slide.

The gradient is essentially the red line which depicts the slope at our current location. Noted that it’s only telling us how much J is expected to reduce by changing w, it doesn’t tell us how large the change should be in order to get to the minimum, so that’s why we need to move step-by-step and thus discussing (1-3) becomes relevant.

(1) # of moves. It is equal to the # of gradient descents. In each epoch, we split our samples into batches and conduct one gradient descent per batch, so if we have infinite # of samples, we then have infinite # of batches, thus we can do infinite # of gradient descents and make infinite # of moves. In this case we can get to the minimum in one epoch, however, we don’t have infinite # of samples, so we can’t guarantee that.

(2) learning rate. Obviously can’t be too large as you said, and also can’t be too small. It can take you forever if you set your learning rate to 0.

(3) size of gradient (and batch size). We know the gradient is essentially error averaged over samples in a batch. If we include all samples into one batch, the errors can be averged out, resulting in very small (but smooth) steps. If we have only a single sample per batch, some steps can be larger (and thus the steps are more stochastic). So choosing a batch size balances between the step size and the smoothness of the move, and in other words, in the context of our discussion, it’s sacrificing the step size for smoothness.

Above are the factors: for # of samples, we sometimes can control and sometimes we can’t. For learning rate, we never know which is the most appropriate one, we only know some choices will work and some won’t. For batch size, it’s a balance that we need to have a smooth convergence. So, we can’t guarantee to converge in one epoch!

There can be other factors - e.g. feature normalization. If the features have very different scales and we don’t normalize, we will be forced to choose a learning rate to fit the feature of the largest scale and although that learning rate is appropriate for that feature, it could be too small for the others and therefore it will take more steps for the others to also be converged. Illustration and explanation of this can be found here.

Cheers!
Raymond

3 Likes

One more point on your first question. In the scope of our MLS, our gradient descent uses one and only one learning rate throughout the whole training process. However, outside of our scope, when we use a different optimizer such as Adam, the learning rate is de facto changed from epoch to epoch. I am rasing this point up because your question looks like to be about TensorFlow in general, and so in general, we have many more advanced optimizers and the Adam optimizer, for example, is one of our default choices.

2 Likes

Very detailed explanations! Thanks.

I am very clear now.

1 Like

Wonderful, and you are welcome Lizhang!

1 Like