Hi Hon,
You are right, it’s just referring to image with 512 feature dimensions and not more than 3 (rgb) color dimensions. It’s just assuming that the image has passed through a convolution layer and has the specified number of features.
This link explains in details how different filters in the convolution layer can lead to different output dimensions.
Hope this helps.