Top layers - transfer learning

Model setup: EfficientNetB4 (include_top = False).
The summary of the last few layers of this model is as follows:

Now I need to add layers for classification on top of this. There are multiple possibilities of what can be done. For example, I can add a global average pooling layer followed by dense layers, or I can first add a conv2d layer to extract features further followed by pooling and dense layers, etc. Further, whether I need to add a dropout layer or not is also a question. I know that I would’ve to carry out some iterations with these settings on my dataset to decide what I should actually do, but I can’t carry out N number of iterations due to lack of time and resources. How shall I develop the intuition to actually arrive at a fair enough setting of layers in let’s say N/4 number of iterations?
The problem statement is that of multi-label multi-class classification on medical images.

I don’t know of any rule of thumb or formula for this. Each case is different and each case requires experimenting. I think that with images, and medical images as the case, it is more crucial to experiment.

Having said that , your N number of iterations (I guess you are referring to epochs on your training set) should include metrics. Your metrics can be a guide to detemine how the fine tuning is going. Also, you can add callbacks for early stopping, for instance, so that you don’t have to wait for the entire run. Those would be two tools I would use.

Lets see if other mentors chime in with other ideas.

Thanks @Juan_Olano !
By N number of iterations, I actually meant “trials”. So for example, in the first trial, I add a global average pooling layer followed by 2 dense layers. I run the training for 5 epochs with some callbacks and see what the result is on some metrics. In the second trial, instead of the global average pooling layer, I would first add a conv2d layer, followed by global average pooling and dense layers. Run the training for 5 epochs and see the result on the same metrics. In this way, I would conduct N number of trials and see what different combinations of layers would yield the best result.

Since you’re building a multiclass classifier, you’ll need to add more layers that terminate with a softmax layer at the output, right? What you’ve shown above ends in an 8 x 8 x 1792 3D tensor. So the simplest way to get to softmax would be to do Flatten, followed by a couple of Fully Connected layers, followed by softmax. 8 x 8 x 1792 = 10752 elements, so you could add FC layers with decreasing numbers of neurons:

10752 → 2500 → 500 → 100 → C

where C is the number of label classes. Then just try training that network with your dataset and see how well it does. Of course if you are doing Transfer Learning here, there are other knobs to turn w.r.t. where you “unfreeze” any of the pretrained layers. But you could start by only training your added layers just to get a starting point to get a sense for where you are. As Juan says, all this is going to require experimentation. You can try to be clever about subdividing the various search spaces to limit the number of experiments you have to do, but the search has to start somewhere.

Whether you’ll need dropout layers will depend on whether you end up having an overfitting problem, but you won’t know that until you get a complete network that is trainable and usable to make predictions. So I would save that question for the next phase of the investigation.

Thanks for the inputs @paulinpaloalto ! The point you made about training only the added layers is a good one that would clearly give some good ideas and intuition.

I’m glad that the ideas sound relevant to your issue. Of course there’s no guarantee that the starting point will be a good enough solution for you, but you have to start somewhere and then figure out what direction to go from there.

@Juan_Olano , @paulinpaloalto after some iterations, below are my findings. There are some additional questions too that I wish to get some opinions on. Please have a look.

Findings:

  1. The following combination of layers over the base model gave better results than other combinations.
base_model output (shape=(None, 8, 8, 1792)) --> Conv2D(units = 448, kernel_size=1, stride=1) --> MaxPooling2D(pool_size=(2,2), strides=2) --> GlobalAveragePooling2D --> Flatten --> Dense(units=112) --> Dense(units=4)
  1. Another combination of layers (in which MaxPooling2D layer is replace by a Conv2D layer with 224 units and kernel_size=2 with stride=2) gave similar result.

Questions:

  1. The above experimentation has been performed on a sample of the original dataset. Thus I’m skeptical about whether I should assume that the above findings should work for the model when training on complete dataset.
  2. Talking about overfitting, while performing the experiments, there were clear signs of overfitting (training auc and loss kept getting better and better with each epoch but validation auc and loss kept getting worse). But since the complete dataset is a huge one (around 500 thousand images) so overfitting might not be a problem while training on the complete dataset. Thus I’m unsure whether I should apply techniques to reduce overfitting based on the results of above experiments.
  3. For the Conv2D layers, number of units, stride, and kernel_size that I’ve chosen are purely based on the objective to reduce number of feature maps while keeping the number of parameters reasonable. How should I actually decide their values?

Hi, Harshit.

Nice work and great followup questions! It was a great idea to subset your data for the purposes of running your initial experiments. The point is in the early stages you need to run lots of experiments and you want them to be meaningful, but not take too long. Just curious what the sizes were that your chose out of the 500k samples for your training and validation sets? Also I assume you randomly shuffled the data before subsetting to make sure your subsets are statistically representative of the larger set. The evidence does look like overfitting, so adding more data is exactly one of the first things to try. You could just (say) double or quadruple the sizes of your subsets and try again to see if that has any effect on the overfitting, rather than going to the full 500k which is going to make the training take a very long time. It makes more sense to try something like that first, before going to regularization.

But the high order question to answer first is whether your network architecture seems like it’s powerful enough to get good accuracy. So what are the accuracy numbers you are seeing on your smaller subsets for training and validation sets?

For question 3), the goal is to get the accuracy you need at the minimum expense in terms of the resource costs (cpu, memory and wall clock time) of the training. So it looks like you’ve got a good starting point and now the question is whether it looks like the accuracy is as good as you need to satisfy whatever your goals are for the system.

Thanks for the valuable points @paulinpaloalto ! I will work on these and update the progress.