Why Gram matrixes give a good sense of style?

Each layer in the channel extracts a particular feature. So to find correlations between features, values in different channels have to be compared.
As to your second question, if activations are similar, you will get a high value. If they are different they will more or less cancel each other out. In this, normalization helps to limit absolute values.