Difference between upsampling feature dimensions in a transformer vs increasing number of features using linear layer

I recently stumbled upon an article on TransformerFTC, where they discuss downsampling features and upsampling them again using a funnel-like architecture.

My question is: what is the difference between changing the number of embedding dimensions between blocks vs. changing the number of dimensions using methods such as nn.Linear? What is the difference, which approach is better, and why? I tried asking ChatGPT, but I couldn’t get an intuitive answer

hi @nikhilsos

are you talking about the bottleneck for the autoencoder visualisation where we reduce the dimension and again increase the dimension to see if the model learned anything about the features in varied dimensions?

can I know if this is any course related to query just to be specific.

Regards
DP

Yes, it is about reduction in the dimension and again increase the dimension to see if the model learned anything about the features in varied dimensions.

But after studying the paper in more detail, it is my understanding that in the paper, the reduction in feature dimension is not by the feedforward networks but by max/mean pooling. Regarding decoder part, nn.upsample is used in ‘nearest’ mode.

I fail to understand what is the difference? what is the benefit changing the dimension of features using pooling and nn.upsample versus using FFN.

I am fairly new, I have a limited understanding, sorry if I failed to articulate it better. Here’s the paper if anyone is interested.

55820361.pdf (205.2 KB)