Yes, it’s pretty tricky to make the jump from 2D arrays to 4D arrays. You can take it one step at a time:
For a single image, it is a 3D array. In our case the images are 64 x 64 x 3, which means you can think of it as 64 x 64 pixels and at each location, you have 3 color values RGB that give you the exact color of the pixel at that location. So it’s 64 x 64 positions with 3 values at each point. Or think of it as three layers stacked behind each other that are 64 x 64: the red picture, the green picture and the blue picture.
Now when you take the next step up and handle multiple images at in a batch, what we do is add the first dimension for the “samples”. The number of samples m = 209 in this case, so think of it as 209 images in a list, each of which has 64 x 64 pixels and at each pixel location you have 3 color values. So it’s 209 x 64 x 64 x 3.
Now when we “unroll” or “flatten” the 4D array into a 2D array so that we can feed it to our neural network, we need to be careful how we do that. Here’s a thread which explains that process in detail.