I believe the answer is no, but i’d like to have someone with better understanding then me confirm.
The order of flattening of a multi dimensional matrix will determine the linear order of the resultant 1xn array. That is there are multiple ways one can flatten a 2d array.
I.E
Going from left to right, then from up to down. Then across channels.
Going from right to left, then from down to up. Then across channels.
Going from down from the top left then back up the next column over, then across channels.
Each will get a different linear order of pixels.
Furthermore, the flattening order of the array determines if spatially close information (as in pixels near other pixels) are close or far away in linear order in the flattened array. Therefore flattening operation loses this spatial context right? (There is still spatial context encoded by the prior convolutional layers).
This… doesn’t actually matter right? As long as the layers are fully connected, every neuron connects to every other neuron right? The weight matrices are the same, only in different orders correct?
Am i also correct that the flattening operation loses some spatial information? But the FC layers don’t really care about that spatial information anyway?
Even though the flattening separates some adjacent pixels, the geometric relationships can still be learned. Remember that we also flatten 2D images in the plain vanilla FC network case and it can still learn to recognize the spatial relationships. The point is that it can recognize them in any of the unrolling patterns, because the behavior is learned. That’s the point. What matters is only that you are consistent in how you do the unrolling in all cases: you decide which method you are going to use and then you use that one method everywhere. Here’s a thread which discusses this purely from the point of view of the Fully Connected networks in DLS C1. Make sure to read all the way through the thread to see the discussion about the different possible unrolling orders. I claim that the same reasoning applies to the case of FC layers at the end of a series of Conv and Pooling layers: back prop will still “connect” across the “flatten” step using the flatten method that you chose.
This made me rethink about the question. As far as information loss is concerned, I think the flattening operation is a lossless transformation because we can convert it back.
Flattened data does not carry the info of how the flattening was performed, and it is not a spatial representation.
FC layer can learn spatial information. For example, we can do digit recognition with just FC layers.
I am editing one of my above replies. Thanks @paulinpaloalto
Right! It’s not that the spatial information is destroyed in the flattening, as long as you remember how the flattening was done. As you say, it’s reversible. It’s just translated into a form that is not useful to us humans with our 3D vision system, but the algorithm can apparently learn to figure it out. Of course it requires that we use a consistent method for the flattening: either “F” or “C” order work, as long as you are consistent.
To further a bit this discussion, which probably will be partially outside of the OP’s question,
Flattening + FC is equivalent to a Conv layer that uses a kernel size equal to the size of the Flattening’s input
Comparing such Conv in the above point with another Conv that uses a small kernel size (e.g. 5x5), the output of the former is less localized than the latter.
Therefore, it can be seen that the output of that FC will start missing local characteristics.
My above points may go outside of the OP’s question because the OP focuses on the change before and after the Flattening, whereas my points on the FC. Just trying to extend this discussion, and welcome the OP to stick back to their original question.
That’s a good point and a nice way to see it. Then I think the higher level point is that the architecture of the network transitions to the FC mode at the point where the conv layers have distilled as much information as they can from the local relationships and now it’s time to take all that distilled information and identify what it is we are looking for. The normal pattern through the earlier Conv and Pooling layers is that we reduce the h and w dimensions of the “image” and grow progressively more channels that are encoding various “derived” characteristics. The other way to get there would be that the h and w dimensions reduce to 1 x 1 and then the channels are the equivalent of an FC layer. Maybe we can conjecture that the folks like Geoff Hinton, Yann LeCun and Andrew and their various grad students and collaborators have experimented with various architectures and decided that stopping before the 1 x 1 conv point and flattening to FC earlier is more effective in general?
Probably so. I believe they grow the networks progressively and stop where were needed to stop. For example, if a cat’s face can have 10 determining features and each normally takes less than 1/5 of the size of the images, then it would be reasonable for the shrinkening process to stop after it is able to figure out any localized features of 1/5 of the size of the images.
Certainly we don’t judge that number - 1/5 - ourselves, but it is reasonable to believe each class has enough number of determining features and each of which should take up some of the space of the image, so that the size of one determining feature should not be too large?
I am correlating the stopping point of that shrinkening process with the normal size of determining features.
No need, this is an interesting discussion, and the thread that @paulinpaloalto linked back to showed empirically that spatial context is not lost given consistent flattening techniques.