Hello, I’m having some difficulty understanding the detailed statement of the U-net’s encoder stage:
“The contracting path follows a regular CNN architecture, with convolutional layers, their activations, and pooling layers to downsample the image and extract its features. In detail, it consists of the repeated application of two 3 x 3 unpadded convolutions, each followed by a rectified linear unit (ReLU) and a 2 x 2 max pooling operation with stride 2 for downsampling. At each downsampling step, the number of feature channels is doubled.”
Conceptually, I believed that the pooling layers alone were causing the downsampling of the volumes, but in the above text it refers to unpadded 3x3 conv layers which would also lead to a downsizing of dimensions. On top of that, the exercise moves forward with using “same” padding which seems to contradict the previous statement.
Is this a possible typo or am I misunderstanding something crucial?
Thanks,
Dylan