I wanted to follow up on this topic as I had a similar issue understanding the term “hidden unit” in this context, and actually ended up asking ChatGPT for clarification
I’ve tried to capture my understanding below and would appreciate any confirmation or corrections, I realise some of the following may be incorrect.
In my understanding, each layer of a CNN has a set of filters, each of which is of size f x f x c
and has learned weights.
The layer 1 volume shape (110 x 110 x 96) corresponds to the activations of layer 1, after the convolution has been applied, and each of the 96 channels corresponds to the activation of one trained filter. I don’t think we know the size of the filters that were trained, but for the purposes of this example let’s assume 8x8. The change in size from 224 to 110 is a result of a stride of 2, I think.
So layer one has 96 filters of 8x8x3, each with it’s own trained weights.
Applying these filters to an input image will result in the 96 channels of the output volume, one channel per filter, where each channel contains the features that have been detected in the input image for a single filter.
If the above is correct, then I think that “hidden unit” refers to the weights of one filter which is essentially a single neuron that is activated by particular arrangements of pixels across the three colour channels.
What’s unclear to me how that intuition might translate into the later layers in terms of the visualisation process shown. I suspect that there is some clever “unwinding” going on to recover the image patches that are shown for the later layers, because the filters for those layers will be detecting complex objects based on features detected in earlier layers rather than arrangements of pixels.
I hope that a) makes sense, b) helps. Also as mentioned above I’d appreciate corrections / clarifications from the course tutors
Thanks for such an awesome course BTW!