Just from a terminology point of view, a matrix is an array with 2 dimensions. These are 4 dimensional arrays. Yes, they are hard to visualize. I think it’s helpful just to consider one “sample”, meaning that you pick a value for the first dimension. Then you’ve got 3 dimensions remaining, which are the height, width and the 3 colors. In the case of 64 x 64 x 3 images, you can think of it as a “stack” of three monochrome 64 x 64 images: one red, one blue and one green positioned behind one another. They show some illustrations in the Week 2 Logistic Regression assignment notebook that have the kind of rendering I described above.
The other point here is that in order to process a 4D array with the type of networks we are using in Course 1, we have to “flatten” each image into a 1D vector, so that the input is a matrix (only 2 dimensions). Here is a thread which explains how that flattening works and that also might be useful to help you visualize how the 4D arrays work.