I think, in contrast to the dense Neural Network, the autoencoder with convolutions is not really limiting the parameters, but actually is increasing them.
We have an Input of 28x28x1 and in the next later the dimensions are already 28x28x64 (the graphic is a bit misleading).
So it is not a bottleneck? It is more like an extraction of features? Or is this only in this example?
Maybe the better performance is due to some form of overfitting and memorizing the images?
That 28x28x64 doesnt have all parameters trainable, that is the output image size, only the kernels of the convokutions have trainable parameters.
Yes, you are right. But when I look at the model summary, the bottleneck has by far the most trainable Parameters compared to all other layers. The logic seems to be different compared to the fully connected network.
The principle is different, one can possibly say that, in the downsample you extract fearures continously (more and more filters) so the number of parameters increase.
Just to jump in here, since I have the same question…
As was mentioned, in the dense autoencoder, the bottleneck was a dense layer with 32 neurons, meaning if I feed in a 28x28x1 (784 pixel) image into the encoder, it produces a 1x32 vector as the latent representation, which is then fed into the decoder network to reproduce the image:
(28x28) → encoder → (1x32) → decoder → (28x28)
Now, with the CNN version, the bottleneck layer of the encoder network is a Conv2D layer with 256 filters, that takes in the 7x7 output of the previous maxpool2d layer. What is the dimensionality of the latent representation in that case? Wouldn’t it be 7x7x256, or 12,544?
(28x28) → encoder → (7x7x256) → decoder → (28x28)
For visualization purposes that 7x7x256 is then collapsed using an additional Conv2D(1) layer to produce a 7x7 image. But in contrast to the 32 dimensional vector in the dense network, this 7x7 = 49 image is not the actual latent representation, is it?
In terms of bits if information, the input has 28x28x16 = 12,544 bits, while the latent vector has (assuming 16 bit floats), something on the order of 200,704 bits, which is way more than the input, not less (by a factor of 16) - not exactly an information bottleneck.