do you mean only 900 images out of the 10000 have some classes?
it will again depend on how the ratio of distribution is present in your data.
for say, if you are stating that these some classes is might be one or two and distributed in 900 images only, then my next question would have you distributed your databases true values and label values.
can I know is the multiclass one hot encoded? or the pred values again classify into different classes?
Training accuracy being 98% and validation accuracy being 92% would mean there is a variance issue.
See the pinned comment which explains bias and variance issue
Also as you stated some classes being only in 900 images is ofcourse pointing on imbalanced dataset, but again depends on your data split between training and validation.
channel and spatial attention mean is your classification multidimensional?? then categorical_crossentropy wouldn’t fit in.
Also these attention layers where you used relu activation, does these last layers have dense layers and if yes then what is the unit for the last dense layer?
Also I wanted to know as you are working on a multiclass model, why didn’t you opt for softmax activation function which would have been a better choice.
Regards
DP