In the first assignment, there is a note says " * A convolution extracts features from an input image by taking the dot product between the input data and a 3D array of weights (the filter)." What is your take on this? It does not make sense to me. For example, if I have an input data of 4x4x4 and a filter of 3x3x4, they can’t dot product each other even when I disregard the number of channels: 4.
Prof Ng gives very concrete examples of how convolutions work in the lectures in Week 1. Have you been through the lectures? Remember the “vertical edge detector” example? The point is that the detection doesn’t happen because of one single “step” of the filter across the image. It is the net result of all the steps which shows the edge.
Also note that your example of a 4 x 4 input it not realistic. How much information is there in a 4 x 4 image? Not much, right? Try downsizing a 64 x 64 image to even 8 x 8 and then take a look at it. Pretty hard to see a cat, right?
Thanks for answering so quick! Yes, I have watched all the lecture in week 1. I guess I was not clear enough with my question. In Prof Ng’s lecture, he mentioned in a step of convolution, we take the elementwise product for each filter and each slice of in the image, it should not be a dot product. This was what I was confused about, element-wise product is definitely not the same as dot product. If I got it wrong, please correct me!
Well, you’re right that it’s not implementable using a straight np.dot
call, but think about what the operation does at each step or position of the input: it multiplies the corresponding elements individually and then adds them all up to produce a scalar value. Kind of sounds like what a “dot product” does in the general sense, right? That’s all he means by saying that.