Yes, looking at greyscale images may help a bit, but the real problem here is that the print statement in python works in a way that may conflict with the most intuitive way to visualize what a given array actually means in terms of its real geometry. They just “unroll” enough dimensions so that they can then print the last 2 explicitly. So in your case the dimensions of A_prev are (2, 5, 7, 4). Take another look at your print out and notice that what you see is 2 groups, each of which consist of 5 matrices each of which is 7 x 4, right? That’s what I mean by how it does the unrolling.
So that means that the “channels” dimension is enumerated across each of the rows there, which is completely different than the picture you showed of a 3 channel colored image, where they did the 2 x 2 depiction on the pixel dimensions and then handled the channel dimension by stacking three squares of pixels. That’s a much better way to visualize things, but the print statement has to handle the arbitrary case, so it can’t make any assumptions about the geometric meaning of the dimensions.
One general comment here is that when you look at the input RGB image, there are 3 channels which are the color values. But once you get past the input layer of the network, the channel dimension will typically not be 3 and you can no longer think of the values as “colors”. They are just real numbers which represent some information that the previous layers have derived or distilled from the input values.
Here’s a thread which shows some experiments that are a bit easier to interpret to show how this is all working. Note that was triggered by a question about trying to interpret the printed output in the “padding” function section earlier in that same “Conv Step by Step” assignment.