I’m confused here is why we would want to transpose a matrix, and not orientate it in a shape originally for it to be dot produced in the future.
Is there some sort of structure for example for a training set “X”, the columns represent the input while the rows represent the different training examples therefore 4 by m? Which raises a question for me is why are the columns represent the input and not the row? Is there like intuitive reasoning?
How we define the data is all just choices. There is no intrinsic reason why the samples are the columns of X rather than the rows. It is just a choice the Prof Ng has made. If you took the original Stanford Machine Learning course, he did it differently there. Of course lots of consequences follow from this choice.
People often ask why we have to transpose the weight vector w in Logistic Regression:
z = w^T \cdot x + b
Whereas when we get to full Neural Networks in Week 3, we no longer need to transpose W:
z = W \cdot x + b
The answer is that these are also choices that Prof Ng has made: he uses the convention that any standalone vector is a column vector. That applies to both w the weight vector and x the sample vector, so we need to transpose w in order for the dot product to work.
But when he defines the W matrices for neural networks, he chooses to stack the weights for each neuron as a row of W and then we don’t need the transpose.
It is just used to make the two matrices compatible for multiplication. Suppose we have m records and each of them has n features. The matrix A will be of shape m \times n.
Now, we decide to have p nodes in the first hidden layer. To deeply connect them you need to have n weights for each feature on each node. The first layer can be represented as matrix H of shape p \times n.
As you can see multiplying these matrices are not possible because of shapes. Therefore the only way is transpose one of them. Now this is up to choice which one you choose. Generally we choose weight matrix and keep input matrix unchanged.
Shape of H^T is n \times p and we can now perform multiplication A \cdot H^T yielding the shape of m \times p. As you can see the batch size is unchanged, next hidden layer will get exact amount of records, but now the number of features is changed, matching the number of nodes in the first hidden layer p.
This is how I learnt this rationale. Although it still doesn’t completely make sense to me, I am working for more robust reasoning.
That is one way you could choose to arrange the matrix. Note that the point of my previous reply is that is not how Prof Ng arranges the data in DLS C1 and DLS C2 and this question was asked in the category of DLS C1. There he uses the arrangement that the columns of A are the individual sample vectors, so in Prof Ng’s scheme A would be n x m.
Yes, but at least in pandas we have rows as the each distinct observation and columns as features of each record. Again we have n x m, which is transpose of m x n. I didn’t want to include it because you can’t include a term that you are explaining in the explanation.