I understood the intuition behind Sliding windows where we assume the final convolved output represents a portion of the original input image with some strides. My doubt is, Is there a formal mathematical proof that each of the output actually represents the input image portion or is there any degree of assumption from our end ?

No, there are no assumptions or leaps of faith required.

Itâ€™s just an application of the rules of matrix algebra.