Question about auto-encoder visualization

The instructor’s code regarding auto-encoder’s visualization:

def bottle_neck(inputs):
‘’‘Defines the bottleneck.’‘’
bottle_neck = tf.keras.layers.Conv2D(filters=256, kernel_size=(3,3), activation=‘relu’, padding=‘same’)(inputs)
encoder_visualization = tf.keras.layers.Conv2D(filters=1, kernel_size=(3,3), activation=‘sigmoid’, padding=‘same’)(bottle_neck)

return bottle_neck, encoder_visualization

My question is, why use filter=(3,3) in visualization layer? Because this does some pixel averaging. If use filter=(1,1) then there is no averaging at all, which displays exact picture…
Thank you

1 Like

I am very sorry, just thinking about those figures I have a more important question; may be I do not understand well the dimention concepts of CNN and encoders. This is the code for bottleneck and visualization pic:
bottle_neck = tf.keras.layers.Conv2D(filters=256, kernel_size=(3,3), activation=‘relu’, padding=‘same’)(inputs)
encoder_visualization = tf.keras.layers.Conv2D(filters=1, kernel_size=(3,3), activation=‘sigmoid’, padding=‘same’)(bottle_neck)

So bottleneck output is actually (7, 7, 256), but we are plotting (7, 7, 1). So what we are plotting is not really encoder content but it’s average content, by condensing 256 numbers into 1 number. Then why are we plotting it???

Also, in English, for encoder the output has to be smaller than input. The input is (28, 28, 1) which is 784 numbers. Bottleneck output is 7x7x256=12544 numbers. So encoder output >> input, which disagrees with my intuition of the concept of encoding. I appreciate comment about it, where is my understanding wrong.
Thank you very much.

1 Like

@Dennis_Sinitsky

it will be more helpful if you include the timestamp of the video in case your query includes something from the instructor statements for mentors to directly refer where you are mentioning.

regards
dp

1 Like

Hi Deepti,
thanks for your reply.


question 1 is about why use filter=(3,3) and not (1,1). And 2 in general, does encoder visualization represent bottle_neck? Because bottle_neck has 256 filters while visualization only has 1 filter.

question 2 is about number of numbers representing initial image.


input number of pixels is 28x28=784. The number of “pixels” out of bottleneck is 7x7x256=12544. But “encoder” should zip data, i.e. should have fewer “pixels” than the input. Encoder is a device which “shrinks” data. Where my is my logic going wrong?
Thank you!

1 Like

Hello @Dennis_Sinitsky

Filters represent the number of output channels after convolution has been performed, while Kernel represents the size of a convolution filter being used to perform convolution on the image.

So the reason why filter of 1 has been used for encoder visualisation because it is decoding the encoded layer where the input had same number of filter as input of 1.

No it does not represent bottle neck as it is compressed representation of bottle neck (compressed form of image)

For question 2

So if you notice input image representing 8 is actually 255 pixel and with noise it is 256, so the input layer where it mentions 28, 28, 1, is basically input for the encoder, so bottle neck comes to the original shape of 256 pixel where it is again decoded to get the reconstructed input of filter 1

Please feel free to ask if any doubts.

Regards
DP

1 Like

Hi Deepti,
I am very sorry, I mistyped. I means to ask why kernel=(3,3) in visualization layer (not filter). Why wouldn’t kernel be (1,1)?

1 Like


In this case, visualization does not make that much sense to me, because it does not really represent bottleneck output but it represents some kind of compression of bottleneck output. We recreate (decode) image not from visualization data, but from bottleneck output data…

1 Like


and finally, I am sorry but I still do not understand the number of data along the encoder piepeline issue. You say that there are 255+1 = 256 pixels. I thought there are 28x28=784 pixels, each pixel represented by a number from 0 to 255 (that is why we divide by 255 to make sure each pixel strength would be between 0 and 1 in a normalized image). So I still do not understand how is that 28x28x1 numbers from 0 and 1 get “shrunk” (or “encoded”) into 7x7x256 numbers of bottleneck output. Also, as you can see in the diagram from the course which you also pasted, 28x28x1 data is passed to 28x28x64 data. I understand 64>1 so it is elongated along horizontal direction. But 28x28 remains the same, why is the shape shrunken along two other dimensions as shown in red arrows?

1 Like

Hello @Dennis_Sinitsky

Dennis if you are asking why kernel size of (1,1) was not use in the convolution layer then the only logic I could apply for this is increase in size of a convolutional kernel would necessarily increase the performance of a convolutional neural network. Kernel size is a hyperparameter and therefore by changing it we can increase or decrease performance.

Another reasoning of using a different kernel size for convolution layer than the input layer is to allowing the architecture to learn from different pattern of filter selection.

Laurence Moroney explains why the neuron units for the encoder need to be different from input to output as in case we can have a straight pass through from the inputs to the outputs and no pattern to represent them would be learned.

If you ask me, a question like can I use kernel size of (1,1), then surely you can but it will have affect on the performance of the model, so the choice of kernel size is based on getting better performance of model by trying different algorithms of layer shaping.

Regards
DP

1 Like

How can you say that it does not hold much sense, if you go by the understanding of what encoder visualisation is in relation to bottleneck, then visualisation cannot be accomplished directly from the encoder to layers to latent representation of image, it has to follow a pattern of being able to learn any new pattern than the original image and also maintain the features of the original image. Remember if we compressed directly an image, there are chances of losing some of the features but this way of pattern selection from input to encoder to decoder helps one to finding different pattern of learning the architecture of image as well as maintaining the features of the original image, so each layer holds as important significance individually as well as holistically when encored.

Regards
DP

1 Like

Input shape is the dimension of the input data , for example in case you have image data a Keras model needs an input_shape of (height, width, num_channels) , if you are feeding a model with an input of (3, 1) the model will learn dependences of three consecutive elements.

While I mentioned you 256 being the pixel for the original image, and 28 x 28 is the image height and image width.

A CNN will learn features of the images that you feed it, with increasing levels of complexity. These features are represented by the channels.

So the deeper you go into the network, the more channels you will have that represents these complex features. Hence you see more number of channels in the bottle neck than compared to the input channels.

Regards
DP

1 Like

Hi Deepti
Thanks for your reply.
Regarding visualization question, reading your post and thinking a bit more I understand that bottleneck output cannot be directly “visualized”, so it has to be post-processed in order to display in some kind of image to feed to our eyes. So it is not exactly encoder output (or bottleneck output, if bottleneck is considered the last stage of encoder), it is somewhat convolved version of that output. No problem and thank you.
I still hope to understand the “encoding” part of the question. Again, there 7x7x256 neurons in the bottleneck; which makes it’s contain 7 * 7 * 256~12.5K large array. But data is 28x28x1, or 784-large array. For normalized images, numbers in the input image are between 0 and 1; numbers in the bottleneck are also continous, may be even negative or larger than 1. So it does not feel like we are squeezing information (encoding information). Yes, 28x28 to 7x7, 28 shrinks into 7; but we have only one 28x28 image and 256 7x7 feature sets in the bottleneck; I mean, two-fifty-six 7x7 arrays have more information than one 28x28 array. So how can it be called “encoding”? This is my major misunderstanding and I still do not see the answer in your explanation.
As for picture (another small question), why is blue dimension (28) drawn longer than green dimension (which is also 28)?
image

Thank you
Dennis

1 Like

@Dennis_Sinitsky

If you go through week 2 videos again, these are explained pretty straightforward, that images are encoded in a way that maintains all the features while being compressed for model to detect any new pattern.

here the image is being encoded from a perspective of maintaining all the features but scaling first from input 3d image to 2d convultion layer and then continuing to compress this 2d image to more compressed feature image, for the model to learn more patterns of details, so it encodes if or any new pattern in the compressed form image.

Ofcourse 256 * 77 can have more information than the 2828 array as this is the whole reason what encoding is trying to do.

You can ask as many as questions I am happy to answer, but also make sure to go through videos again and again when in doubt as this might help you.

28 * 28 input is a 3-d image where as conv2d 2dimensional, hence the image shown is for conv2d looks compressed.

You can ask if more doubts

Regards
DP

Hi DP:
thank you for your replies. For the CNN diagram picture, I think I understand.
28x28x1 is an input image, where every number is actual pixel. And 28x28x64 is processed numbers, not actual pixels. So 28 pixels having the same length in a picture as 28 numbers – I mean, it’s like apples and oranges, cannot compare.
Dennis

1 Like

Dennis even this is 28x28x64 pixel image but 2d image where as 28x28x1 pixel image is a 3D dimensional image, so in the image it has been shown in more smaller size for presentation purpose of understanding.

1 Like

Thanks, Deepti. I see what you mean. And I think you see what I mean.
May be it is convention in ML literature to draw it like this for image input. But to me, 28 is 28… so if I focus on this detail, it is a little confusing :smile:
Anyway, thank you very much for your comments.
DS

2 Likes