Help Me With NN Model Accuracy


data type :Images
data without noise
done: data augmentation
predicting 33 classes using resnet50 architecture
val_data size =8%
batch size 64
epochs = 100

lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=10, verbose=1)
es = EarlyStopping(monitor='val_loss', patience=20, restore_best_weights=True)

PLZ SUGGEST ME FROM ABOVE WHAT TO DO

Since we’re dealing with an image classification problem, I’m assuming that the human level accuracy is > 95%. Given that the training accuracy looks pretty low, you have a high bias issue. The high difference between training and validation accuracy indicates a high variance problem as well.
Have you completed deep learning specialization?

@TMosh Does MLS cover error analysis?

1 Like

Yes, I was going to mention the same point that Balaji did: 0.4 training accuracy is very low even before you start looking at the “overfitting” problem (if you can really call it that in this case :laughing:).

It might help to know more about your dataset: how many total samples do you have?

Did you create the model from scratch in PyTorch or are you doing Transfer Learning with a predefined model? In the latter case are you training from scratch or is the model pretrained on some different dataset?

How much experience do you have in using PyTorch and what ML/DL courses have you taken?

Just thinking a little more, my first approach would be to investigate if there is some fundamental problem with your setup. E.g. your output layer is not setup correctly for the class labels or the like. Or maybe something simple like using raw RGB images as the inputs without normalizing the pixel values to be in [0,1] (divide by 255) or [-1,1] (mean normalization).

yes sir i have completed DLS, about human level accuracy yes you can say it is >95% i want to improve my model like it stops after 40 epochs without going for 100 also this is my first time working on image dataset if possible recommend me something

Thank you for replying, sir. Based on my knowledge, the latest research on this dataset reports a maximum accuracy of 69%. This is my first time working with an image dataset. I am currently using the ResNet50 architecture, but not by importing pre-trained weights. Previously, I used AlexNet, which gave very poor results with about 21% accuracy on both the validation and training sets. For data preprocessing, I will try training without normalization. Additionally, I am applying data augmentation only to the training set, not the validation set. Finally, my model training stops at epoch 40.
My experience : beginner

Why are you starting with data augmentation? Note that you did not answer my question about the size of your dataset. That’s a pretty key fact to know.

That was not my recommendation: normalization is critical to get convergence. I was suggesting that perhaps your mistake was not using normalization. If the images are raw RGB images with uint8 pixel values between 0 and 255, it is important to normalize. Just dividing the values by 255. is the first thing to try. But make sure to dump the pixel values on some of your images to make sure what they are. Maybe they are already between 0 and 1.

Why does that happen? It must be to do with how you have specified the parameters to the various PyTorch training APIs. You need to read the documentation and figure that out.

Q1dataset contains 33 classes and each contains close to 1k images
Q2 i am not using pytorch but tensorflow

When I googled ReduceLROnPlateau I only got links to PyTorch, which is why I jumped to that wrong conclusion. Sorry. I had never seen that API before. Ok, I went to the TF website and searched there and there is a TF API like that, but it’s a “callback”. I don’t remember a case in which we used TF callbacks in DLS. There are lots of good tutorials on the TF documentation site. I hope you’ve read some of those.

That should be enough to get better than 40% training accuracy, one would normally think. But you also said that research with this dataset has not been able to get better than 69% accuracy. Was that training accuracy or test accuracy? In either case, that makes it sound like this is a pretty challenging problem. :scream_cat: If the researchers can only get 69%, then maybe you’re not doing so badly as a relative beginner getting 40% (so far). :smile:

try data augmentation on both training and validation dataset when you want a better normalisation of data analysis.

probably that’s why your training and validation accuracy graph also is in opposing direction on the graph.

@balaji.ambresh input was probably the right directing you to work on that bias and variance issue.

restart, reapproach but understand first your data thoroughly first.

your 33 classes in your data how it is spread. are your 33 classes of same datatype. you mentions validation data size 8% tells you probably used most of your data on training, resulting in better training accuracy.

This happens because of EarlyStopping callback. It monitors the validation loss and then stops training when no improvement happens for 20 epochs. One way to ensure that this doesn’t happen is to set patience to the number of epochs the model is trained for.

It’d help if you shared a links to the dataset and SOTA model / paper.

This is fine. When augmentation is performed, the learning curves are likely to be bouncy. One approach to deal with this is to wait for the learning curves to stabilize before drawing conclusions about the model.

Is there a good reason for using only the model architecture and not the pre-trained weights?

Are the label distributions the same across train / test splits? If a method like train_test_split is used, set stratify parameter to the label field.

1 Like
  1. now running code with out callbacks this is the result
  2. the latest paper uses other dataset. i donot have access to that paper but they have used attention based model combining both image and text on image to make prediction.
  3. but other papers that use this data set this is what they have mentioned

The pre-trained AlexNet with transfer learning resulted in a
test set Top 1 classification accuracy of 24.7% , 33.1% for
Top 2, and 40.3% for Top 3 which are 7.4, 5.0, and 4.0
times better than random chance respectively. As comparison,
using the modified LeNet, we had a Top 1 accuracy of 13.5%,
Top 2 accuracy of 21.4%, and Top 3 accuracy of 27.8%. The
AlexNet performed much better on this dataset than the LeNet.

  1. i am not using train_test_split or stratify but simply
train_datagen = ImageDataGenerator(
    rescale=1.0 / 255.0,
    rotation_range=20,
    width_shift_range=0.15,
    height_shift_range=0.15,
    shear_range=0.15,
    zoom_range=0.15,
    brightness_range=[0.8, 1.2],
    horizontal_flip=True,
    validation_split=0.08  //*percent 
)

It’s impossible for me to help without knowing details about the dataset you’re using. Besides, if there’s an implementation that uses attention based mechanism (i.e. transformer architecture) and multimodalal (text and image) mechanism to achieve good results, please explain what you’re trying to do with Resnet and AlexNet.

Without callbacks, why are the epochs ending around 70 instead of 100?

ImageDataGenerator is deprecated in favor of tf.keras.preprocessing.image_dataset_from_directory. Remember to provide the seed to get reliable non-overlapping splits between training and validation data and subset to be explicit about the split you’re interested in.

this is dataset

bcoz i dont know dont know much about them(attention and multimodel)

Please start here to get an idea of what multimodal means:

  1. Wikipedia
  2. On GCP

As far as the general concept of attention is concerned, transformers were covered as part of course 5 in DLS. Was that information insufficient or did you forget the lectures? Transformers in vision field weren’t covered in the course. So, depending on the underlying model, you might want explore further.

There are a few notebooks that try to classify an image of the book cover into the category. This one might be useful since there isn’t much of a logical structure in images of a book IMO. Describing the book cover and using text features (maybe along with image features) could produce better results.

Does multimodal mean that there is one model which processes everything or are there multiple models that process each type, like text, video and so on?

Multiple types of input are handled by the final model. How different inputs are processed & combined is your call (see this as well).

Here’s the image from GCP page:

Hello
You can use early stoping
I think it can help you to not have over fitting in your model because your validation loss and training loss some where are equal but from one point validation loss increase
You can use it to handle this problem.

I split the data using train_test_split with stratify=True. Using ResNet 50

All classes contain almost 900 images for train and 50 for test

Data augmentation for test set =False
Data augmentation for train set =True

while training
241/241 ━━━━━━━━━━━━━━━━━━━━ 693s 2s/step - accuracy: 0.0430 - loss: 4.0123 - val_accuracy: 0.0495 - val_loss: 3.4664
Epoch 2/80
241/241 ━━━━━━━━━━━━━━━━━━━━ 419s 2s/step - accuracy: 0.0587 - loss: 3.4702 - val_accuracy: 0.0632 - val_loss: 3.4085
Epoch 3/80

Here is the result:


the validation errror is still same

The validation error remains the same.

CASE 2
Data augmentation for test set =True
Data augmentation for train set =True

while training

483/483 ━━━━━━━━━━━━━━━━━━━━ 651s 1s/step - accuracy: 0.0449 - loss: 3.5944 - val_accuracy: 0.0487 - val_loss: 5.2786
Epoch 2/50
483/483 ━━━━━━━━━━━━━━━━━━━━ 455s 928ms/step - accuracy: 0.0598 - loss: 3.4806 - val_accuracy: 0.0644 - val_loss: 3.4798
Epoch 3/50

here is the result

NOTE :I want to explore different approaches to work with image datasets.
Since I am a beginner, I do not want to use attention-based models or large language models (LLMs). I prefer to start with simpler models.