We had two doubts. can you please help to clarify ?
1 How max pooling helps capturing spatial invariant (even if the input image is in different positions like left corner, bottom corner ) ?
Why max pooling needs conv as input ? can we directly use max pooling applied for image instead ton conv output operations ?
Why max pooling better than average pooling ?
That’s 3 doubts, but whatever.
Where does it say anything about “spatial invariants”? I’m not sure what you mean by that. Note that pooling works the same way that conv filters do: they operate on “patches” of the image, meaning that the behavior is “localized” by definition.
There is no technical reason why you can’t apply pooling directly to the input images, but what would be the point of that? It’s equivalent to just “downsampling” the input images. Why don’t you just start with smaller (lower resolution) images? That would be a one time cost, instead of a cost you incur in every iteration of training.
Where does it say that max pooling is uniformly better than average pooling? As with just about everything here, it is situational: sometimes max is better and sometimes average is better. Maybe the better way to frame the question is: “how do I know in a given situation whether max or average pooling is the best choice?” I don’t know a definitive answer to that question, but my guess is that the answer is you try both and see which works better.