Hi @Michal_Kurkowski ,
I’d like to attempt an explanation to your question with some hypothetical exercise to try to help build intuition on this matter.
Disclaimer: I will skip and oversimplify many steps here.
You already learned about Forward Propagation in week 1 of this course. This is one of the 2 fundamental components of the learning process in Neural Networks. You saw that, for each layer you have to follow 2 steps:
- Calculate a linear function Z = w x A.T + b
- Calculate a non-linear function A’ = f(Z)
So the linear function w x A.T + b will happen over and over again, from layer 1 to layer L (last layer).
Another concept that is important to keep in mind: In the forward prop, in the very first layer when we feed the data to layer 1, our goal is to feed all features of the data sample to all units of this first layer, so that each unit learns and decides what to represent of all the features. And this goal of “feeding all features to all units” is repeated layer after layer.
Ok, lets get started with this hypothetical exercise:
In layer 1 you will do, as described above, Z = w x A.T + b. And in this case of layer 1, A is actually the training dataset X, where we have all the samples, each with its features. We will say that X = A0.
Ok? so, let move on:
Ignoring the convention and definitions on how X should be formatted, lets say we do as you say: Start X “already transposed”… that is, with the features as rows and the samples as columns.
In this hypothetical case, we could go straight to Z1 = w1 x A0 + b1 avoiding the transpose operation of A0, right? and now we have Z1 and we apply the non-linear function to get A1. By the way, in this first step we reached our goal of hitting all units with all features of X.
Next we move to layer 2. The parameter w2 of layer 2 is of shape (layer2.number_of_units, layer1.number_of_units).
Now we feed A1, which we calculated in the previous step, to layer 2, which happens to have a different number of units than layer 1 (although it could have the same number of units too).
On layer 2 we have to do again a linear operation Z2 = w2 x A1 + b (note that I omitted the transposition) using the parameter w2 of layer 2. But wait… now A1 is not in a shape that can be multiplied with W2 because layer 2 had more units so w2 has a different shape! what can we do? we’ll need to do a transpose to do Z2 = w2 x A1 + b2, so we go back to Z2 = w2 x A1.T + b2
If layer 2 had the same number of units of layer 1, then the MatMul may have worked BUT we would have failed on our goal of hitting all layer 2 units with all layer 1 “features”, and instead we would be feeding all layer 2 units with all layer 1 “samples” and the result would be caotic.
Conclusion:
You can see how starting with an X “already transposed” will only avoid the very first transposition, as when you advance through the neural network, and actually in the very next layer, you’ll need to use the transposition anyways, not only to make possible the MatMul but also to achieve the goal of bringing “all features to all units”.
Going back to “Ignoring the convention and definitions on how X should be formatted”. It is established (and in my head, logically established), that X, which is the data, should have each sample as a row, and each row should contain the features as columns. I guess everything could have been defined differently from the very beginning of AI times, but it seems logical to have samples as rows and features as columns.
What do you think of all this?
Juan