Where does it say anything about “spatial invariants”? I’m not sure what you mean by that. Note that pooling works the same way that conv filters do: they operate on “patches” of the image, meaning that the behavior is “localized” by definition.
There is no technical reason why you can’t apply pooling directly to the input images, but what would be the point of that? It’s equivalent to just “downsampling” the input images. Why don’t you just start with smaller (lower resolution) images? That would be a one time cost, instead of a cost you incur in every iteration of training.
Where does it say that max pooling is uniformly better than average pooling? As with just about everything here, it is situational: sometimes max is better and sometimes average is better. Maybe the better way to frame the question is: “how do I know in a given situation whether max or average pooling is the best choice?” I don’t know a definitive answer to that question, but my guess is that the answer is you try both and see which works better.