How we define the data is all just choices. There is no intrinsic reason why the samples are the columns of X rather than the rows. It is just a choice the Prof Ng has made. If you took the original Stanford Machine Learning course, he did it differently there. Of course lots of consequences follow from this choice.

People often ask why we have to transpose the weight vector w in Logistic Regression:

z = w^T \cdot x + b

Whereas when we get to full Neural Networks in Week 3, we no longer need to transpose W:

z = W \cdot x + b

The answer is that these are also choices that Prof Ng has made: he uses the convention that any standalone vector is a column vector. That applies to both w the weight vector and x the sample vector, so we need to transpose w in order for the dot product to work.

But when he defines the W matrices for neural networks, he chooses to stack the weights for each neuron as a row of W and then we don’t need the transpose.