Having finished the Machine Learning specialisation, I have embarked on the Deep Learning specialisation and I am noticing differences in how the material is presented, which are causing some confusion for me.
In the ML specialisation, it was my impression that the matrix X always contained the training examples as rows of the matrix and the features as columns. In the DL specialisation, I am in week 3 now and already a few times I have seen the matrix X defined as having features as rows and the training examples as columns. Why is that?
There is no universal rule for this. Both ways are correct. The only caveat is that we have to make later calculations compatible with our data.
Yes, as Saif says, this is just a choice that Prof Ng gets to make in each case. He created the DLS Specialization first and decided to keep the “samples” dimension as the second dimension for the feed forward networks. The implications of that choice carry through all the following calculations and it just makes the math expressions a bit cleaner and simpler. But this is a matter of taste, basically. Even in DLS, when we get to Course 4, which deals with ConvNets where the inputs are 3D images (typically), he goes back to the orientation where “samples” is the first dimension, so a batch of images is a 4D array or tensor with dimensions:
samples x height x width x channels
In that case, he is constrained by the fact that we are using TensorFlow and the TF APIs all are defined to use the “samples first” orientation.
So the bottom line is that Prof Ng is the boss and we just have to understand and follow how he is defining things in each case.
I thought the MLS specialization derived from the Stanford CS and early Coursera material, which significantly predates the DLS course material. My impression is that in general, the pedagogy for MLS has much older roots, going back decades even, so that may have some influence. Regardless, I agree with the observations above that it’s a style or practice, which may sometimes map to the underlying domain (like image processing on NVIDIA GPU works better one way than the other) and sometimes just is.
It is a good point that MLS is the modern adaptation of the original Stanford Machine Learning course, first published in 2011 or thereabouts, I think. And in the original Stanford course, he also used the same “samples first” orientation of the data that he uses in MLS. So the chronology was “samples first” in 2011 - 2017, then when DLS was published in 2017, he switched to “samples last” orientation, then MLS was published (I forget exactly when, but 2020 or later) and returns to the earlier “samples first” orientation.
The other formulation in DLS that differs from Stanford ML is that he represents the bias term separately, so that you end up with an Affine Transformation instead of pure Linear Transformation. In Stanford ML, he treated the bias term as a feature with the feature value of 1. I have not taken MLS, so I don’t know how the bias term is handled there.
Of course we’re all agreeing that the high level point remains that these are all matters of taste or style and subject to change.
MLS uses a separate bias term.