I’m confused here is why we would want to transpose a matrix, and not orientate it in a shape originally for it to be dot produced in the future.

Is there some sort of structure for example for a training set “X”, the columns represent the input while the rows represent the different training examples therefore 4 by m? Which raises a question for me is why are the columns represent the input and not the row? Is there like intuitive reasoning?

How we define the data is all just choices. There is no intrinsic reason why the samples are the columns of X rather than the rows. It is just a choice the Prof Ng has made. If you took the original Stanford Machine Learning course, he did it differently there. Of course lots of consequences follow from this choice.

People often ask why we have to transpose the weight vector w in Logistic Regression:

z = w^T \cdot x + b

Whereas when we get to full Neural Networks in Week 3, we no longer need to transpose W:

z = W \cdot x + b

The answer is that these are also choices that Prof Ng has made: he uses the convention that any standalone vector is a column vector. That applies to both w the weight vector and x the sample vector, so we need to transpose w in order for the dot product to work.

But when he defines the W matrices for neural networks, he chooses to stack the weights for each neuron as a row of W and then we don’t need the transpose.