I tried to fit the CoAtNet model, which combines Convolution (MBConv Block) and Relative SelfAttention mechanisms for image classification, with the cifar100 dataset, using the documentation and sample code. But no matter how much I played with the parameters, I could not get the value accuracy to exceed 50%60%. I can estimate that these values should be above 90% for such a model (the same may not be true for hyperparameter values) from the fact that it reached 90.8% accuracy when trained with ImageNet21k + ImageNet1k datasets. Since the dataset and the classes to be predicted are quite small, I think there is hope for the model to get 100% val accuracy, but as I mentioned before, val accuracy starts to fluctuate or remain constant above 50%. In this process, the accuracy usually increases continuously, but I could not see a 100% value for accuracy. I thought that this might be due to overfitting and I tried to realize the following 3 things we should do when we encounter overfitting:
 Reduce model complexity.
I tried this by reducing the number of blocks or the number of block repetitions or the number of block units, but it is still the most important item that I am not sure if I am implementing correctly.

Try experimenting with drop rate hyperparameters for the last layer and drop connect rate hyperparameters for each block result. (Improved val accuracy but was very insufficient)

Apply more than one hyperparameter combinations.
I also tried Bayesian optimization algorithm for hyperparameter selection, but I would like to get your advice in applying these algorithms.
 Adding data augmentation layer.
I have trained the model many times with many combinations but I could not find an effective solution to increase the validation accuracy.
CoAtNet Architecture:
Hyperparameters:
Our model trained without Stochastic depth, Cosine Decay with warm up, Gradient Clip and EMA hyperparameters.
Model Evaluation in Paper:
Testing CoAtNet0
> Note that increasing the drop rate and drop connect rate increases the validation accuracy value of the model to a maximum of 60%65% and remains at fluctuating and constant values like val accuracy in this graph.
The implementation(keras) model i use for all testing:
The CoAtNet Paper:
[2106.04803] CoAtNet: Marrying Convolution and Attention for All Data Sizes (arxiv.org)