Are metrics for each batch or whole dataset?

Hello all

(W3/assessment)

I see when training is in process, accuracy and loss are improving gradually through each individual epoch.
but when the next epoch starts, these values jump up.

regarding this, i have two question :
1-Are accuracy and Loss metrics which change gradually during each individual epoch, are over the whole training data set?
2-Why that happens(accuracy gradually improves in each epoch and suddenly improves in the start of next epoch)?

Please move your topic to the correct subcategory.
Here’s the community user guide to get started.

Different batches of data may cause the cost to increase temporarily, because the model has not yet included that data in training.

thanks for your answer.

I mean this part :
Epoch 1/10
1875/1875 [================>==============] - 51s 27ms/step - loss: 0.1780 - accuracy: 0.9456

so that loss and accuracy during each individual epoch, is calculated against the whole training data set. is it correct?

why does it jump suddenly at the start of the next epoch?

as it was mentioned in the question, it is related to assignment of the week3. is not that in the right place?

Sorry, I’m not a mentor for this course, so I do not know the details of this assignment.

I think YES.

This is an interesting question. One possible reason might be the shuffle of data. So, at every epoch, the order of the data changes. But I am not sure. Let’s see what other mentors say about this question.

This is the pseudocode of the training loop:

for epoch in range(self.num_epochs):
  for batch in self.batches():
    X, y = batch
    self.optimize(batch)
    y_hat = self.predict(X)
    self.metric.update_metric(y_true=y, y_pred=y_hat)
    self.do_bookkeeping(self.metric.result())
  self.metric.reset()

It’s possible to keep a running track of the accuracy such that when you call metric.update function , the count and number of correct predictions is updated and when metric.result() is called, the result is computed as num_correct / count. When metric.reset() is invoked, all counters are reset to 0. As a result, when the next epoch begins, model accuracy jumps and then becomes a smoother estimate of the real accuracy as more batches of data are processed.

Here’s some code to show how calculations work:

import tensorflow as tf

def accuracy(y_true, y_pred):
      return tf.cast(y_true == y_pred, tf.float32)
metric = tf.keras.metrics.MeanMetricWrapper(fn=accuracy)

# 1st epoch
y_true = [[1, 1, 0, 0]]
y_pred = [[1, 0, 0, 0]]
metric.update_state(y_true, y_pred) # <tf.Variable 'UnreadVariable' shape=() dtype=float32, numpy=4.0>

metric.result() # <tf.Tensor: shape=(), dtype=float32, numpy=0.75>

y_true = [[1, 1, 0, 0]]
y_pred = [[1, 1, 0, 0]]
metric.update_state(y_true, y_pred) # <tf.Variable 'UnreadVariable' shape=() dtype=float32, numpy=8.0>
metric.result() # <tf.Tensor: shape=(), dtype=float32, numpy=0.875>

# 2nd epoch
metric.reset_state()
y_true = [[1, 1, 0, 0]]
y_pred = [[1, 1, 0, 0]]
metric.update_state(y_true, y_pred) # <tf.Variable 'UnreadVariable' shape=() dtype=float32, numpy=4.0>
metric.result() # <tf.Tensor: shape=(), dtype=float32, numpy=1.0>

@saiman

Do you have a good reason for not moving your topic to the correct subcategory?

Hi balaji. I think now the subcategory is correct. I hope I don’t make that mistake again.

i appreciate your complete answer but I found some materials which are in contradiction with that. i would be thankful if you could take a look at it and give me your feedback. hear it is said that the matric and loss values are not based on one batch, but an average of all passed batches.

" on_batch_end() type callback function gets the accuracy of the batch that just got trained. Whereas the logs printed by keras is the average over all the batches that it has seen in the current epoch. You can easily observe that in your logs. say in first 2 batches one accuracy was 0.0 and 1.0, which made the overall accuracy over 2 batches seen as 0.5000. here is exactly where the average is calculated.
From https://stackoverflow.com/questions/42004948/accuracy-from-callback-and-progress-bar-in-keras-doesnt-match?rq=3
"

WHAT IS STEPS AND EPOCH?
Steps refer to the number of batches processed by the model during training. A batch is a subset of the training data used to update the model’s weights. The number of steps defines how many times the model goes through the training data.
On the other hand, an epoch is defined as one complete pass through the entire training dataset. In other words, one epoch means the model has seen the entire training data once. During an epoch, the model goes through multiple batches of the training data, updates its weights, and learns from the data.

DIFFERENCE BETWEEN STEPS AND EPOCH
The main difference between steps and epochs is that epochs refer to the number of times the model sees the entire training dataset, while steps refer to the number of batches processed during training.

For example, suppose you have a training dataset of 10,000 images, and you set the batch size to 100. In that case, each epoch will consist of 100 batches, with each batch containing 100 images. Therefore, to complete one epoch, the model will process 100 batches, each with 100 images, resulting in 10,000 images seen by the model.
Suppose you set the number of steps to 1000 and the batch size to 10. In that case, the model will process 10 images in each batch, resulting in 10,000 images seen by the model after 1000 steps.

RELATIONSHIP BETWEEN STEPS AND EPOCHS
The relationship between steps and epochs in TensorFlow depends on how you define your training process. You can define either the number of steps or the number of epochs for your model’s training. However, it is essential to understand how the two parameters affect your model’s performance.

If you define the number of epochs for your model’s training, TensorFlow will automatically calculate the number of steps required to complete the training. For example, if you set the number of epochs to 10 and the batch size to 100, the model will process 100 batches in each epoch, resulting in 1000 steps for the entire training process.

On the other hand, you can define the number of steps for your model’s training, and TensorFlow will automatically calculate the number of epochs required to complete the training. For example, if you set the number of steps to 1000 and the batch size to 100, the model will process 100 batches in each step, resulting in 10 epochs for the entire training process.

In general, increasing the number of steps or epochs can lead to better model performance, but it can also increase the training time. It is essential to find the right balance between the two parameters to achieve optimal performance while keeping the training time reasonable.

Steps and epochs are crucial parameters for training deep learning models. Steps refer to the number of batches processed by the model during training, while epochs refer to one complete pass through the entire training dataset. The relationship between the two parameters depends on how you define your training process.

Regards
DP

2 Likes

Please note that while the callback can contain the metric for the current batch, my explanation holds true based on how metrics are calculated / updated and reset.
Metric and Callback are different entities.