Train Generator and Imbalanced Dataset

Hello everybody,

I am learning coding with tensor flow CNNs and i have some follow-up questions to train generator and batch size that were taught in first week of 2nd course of TF specialization.

Let me describe following scenarios:

Scenario 1

  • i have binary classification Cat/Dog
  • have only train folder in train folder i have subfolders “Cat” and “Dog”
  • each folder has 1000images (totally cats+dogs = 2000 images)

having 2000 images in train folder, i code train generator to have batch size = 20

code train_generator = train_datagen.flow_from_directory(
batch_size=20 )

it means that train generator is loading 20 images per batch and as a result i have to set parameter steps_per_epoch in fit mehod accordingly to run through all data, correct?

it means that there must be 100 steps per epoch to run through all 2000 images (100 * 20), am i right?

model.fit(steps_per_epoch = 100)

Prediction:
then lets say i will do prediction, data fed to train generator are loaded based on alphabetical order, so it means that if model predicts 0 than it is a a cat, is it correct ?

Scenario 2: Imbalanced Data
let’s say that cat/dog data are imbalanced

  • in subfolder Cat there are 1000 images,
  • in subfolder Dog there are only 500 images.
  • Total number of images:1500

Having totally 1500 images, i set batch size: 15 and steps per epochs 100 to run through all examples, but having imbalanced data, if i am loading 15 images per batch to model, are those imbalanced cat/dog images loaded based on % portion of each?

if cats are 1000 images and dogs are 500 images, is it loaded in batches by generator like 10 cats images and 5 dogs images?

Or in case of imbalanced data, should i always made them balanced to have same number of images in both subfolders (e.g. get more data etc.) ? Could you tell me please what is “best practice” with this regard?

i know that those are kind of rookie questions but i am trying to understand how train generator and model is behaving with different data.

Thank you very much for answer :slight_smile:

Best Regards

Filip

Hi @Filip1988,

Tricky question…:slight_smile:

You can find many different approaches to this topic, depending on the problem you are trying to solve.

To keep it short, I’d start focussing on:

  • Data augmentation and re-sampling. You can choose either under-sampling - removing the data from the majority class - or over-sampling - add repetitive data to the minority class. Image augmentation is discussed in detail during the course.

  • Re-weight. You can relatively higher costs to examples from minority classes - the lower data in the certain classes then the higher weight it gets, vice versa. There is a parameter named as class_weight in model.fit which can be used to balance the weights.
    Something like this:

from sklearn.utils import class_weight
...
class_weights = dict(zip(np.unique(y_train), class_weight.compute_class_weight('balanced',np.unique(y_train),y_train))) 
...
model.fit(X_train, Y_train, ..., class_weight=class_weight)
  • Change loss function. One of the common loss functions for solving the class imbalance problem is using Focal Loss - you can check this article for details. This function reduces the weight of correctly predicted values. This has the net effect of putting more training emphasis on that data that is hard to classify. I believe this is implemented into tfa module - check this link.

You can give it a try to any of these approaches.

Have fun!

1 Like

Thank you, I am gonna try it :slight_smile:

Hi …i use ur advice in rock classification but not any one get a good acc
Augmentation and over sample acc is less than acc without it …i do many thing but not benefit