Yes, your observations are correct: because of the way these “feed forward” networks work, we need to “unroll” the 3D images (height x width x colors) into vectors. You would think that you lose the geometric information when you do that, but it turns out that the algorithm can learn to recognize the patterns even in the “flattened” form. You’re also right that there are several ways in which you could do the “unrolling”. It turns out that any of them will work as long as you are consistent and handle all the samples in the same way, just as you say.
Here’s a thread which discusses the mechanics of flattening the images and also addresses your point about the different methods.
Later in Course 4, we will learn about Convolutional Networks, which can actually process the 3D images in their original form with more powerful results. Stay tuned for that!