The discuss link is broken on the Vectorization Part 1 page. I have a question about shape (m, ) of ndarray vs shape (m, 1) of ndarray, and how to think about scalars vs. vectors for each observation.

Thank you!

The discuss link is broken on the Vectorization Part 1 page. I have a question about shape (m, ) of ndarray vs shape (m, 1) of ndarray, and how to think about scalars vs. vectors for each observation.

Thank you!

Hello @pritamdodeja,

We will use here for all discussions.

Would you mind sharing one or two examples to elaborate your thought or what you are not sure about?

I watched this video about tensors and the way he explained the rank of scalars, vectors, matrices was very elegant. An unfortunate side-effect of this is I go to great lengths to avoid scalars unless Iâ€™m 100% sure a scalar is appropriate (e.g. strings, intercepts etc.).

For example, if weâ€™re talking about a feature that represents the square footage of a house. I imagine a basis vector and an associated index, making it a rank 1 tensor. So, I would think the shape of X would be (m, n, 1) because one of the nâ€™s represents the square footage. Then I think shouldnâ€™t everything in tabular data have the last axis of shape 1? Next thought that comes to mind is is there any computational difference between shape (m, n, ) and (m, n, 1)? For example, I worked on a vectorized version of the multiple linear regression lab where I vectorized w, but not the loop for number of iterations. The loss function reduces every iteration as expected, but it does not go as low as the non-vectorized version. However, if I normalize the data, Iâ€™m able to get it to converge nicely (the vectorized version doesnâ€™t diverge, it just doesnâ€™t get as low as the non-vectorized version in the same number of iterations).

So my big picture question is, is there a difference here I should be thinking about? Does using (m, n, ) push the problem into higher dimensions than is necessary?

Our convention is, a dataset has a shape of (m, n) where m is the number of samples and n is the number of features. Can you explain why you think it should be (m, n ,1) instead? Whatâ€™s the purpose of that last 1? Based on your understanding, in a machine learning data set, when will it be larger than 1?

I can assure that the difference between looping through samples and vectorization is just the matter of computational efficiency, not about training performance. However, you also pointed out below

And that will matter in gradient descent. We need to normalize our features so the gradient descent can work well, no matter you are looping through samples or using vectorization.

My rationale for `(m, n, 1)`

is that the `n`

features do not have the same units, so they donâ€™t have the same basis vector, so putting them next to each other (in my mind) violates the meaning of the vector, since they have become â€śunit lessâ€ť now. As far as the rightmost axis being larger than one, an image for example would have the last axis thatâ€™s larger than one. Each observation of pixel can be thought of being mapped to a cube. I am glad you brought this up as I was wondering how one deals with â€śtuplesâ€ť, and the image situation is the perfect example.

The reason I got different values for the loss function across vectorized and non-vectorized might have to do with the fact that vectorization wasnâ€™t the only change, I also made the shapes `(m, n, 1)`

. In the vectorized + `(m, n, 1)`

case, Iâ€™m using `f_wb = np.dot(w.T, X) + b`

and reshaping it to match y. I wonder if the actual multiplications and additions are the same when non-vectorized and `(m, n, )`

. My gut tells me no. Update: I just found a bug in my cost function, was using 1m vs 2m in the denominator, am going to recreate the same conditions as the lab notebook and report back on the vectorization thread that already has a lot of activity.

1 Like

Hi @pritamdodeja, thank you for sharing your thoughts, now I can understand.

First, it is certainly not a problem to use (m, n, 1), but you will need to change all other convention in order to compute the correct result.

Second, (m, n) has been the convention adopted by all ML packages that I have seen, so at the end, if you want to use those packages (e.g. sklearn), you will need to change it back to (m, n).

Third, there is completely no problem to have different units in a vector both computationally and in terms of meaning. There is a problem if you add up two things of different units, but storing them in a vector doesnâ€™t violate any rules.

Fouth, we donâ€™t really need to care about the units in training a machine learning model. Additionally, when you normalize your data such as by \frac{x-x_{mean}}{x_{std}}, you remove any unit x carries.

Fifth, one feature vector of n features is a vector in the corresponding feature space of dimensions n, representable by n basis vectors corresponding to its features, and carry whatever unit each feature has.

Lastly, I am trying to persuade you to stick with the (m, n) convention because there is nothing wrong about it, and it is the convention. I am not trying to say your way is wrong, but the convention is right too.

OK. They should produce the same result. If not, there is a bug somewhere.

Let me give you an example of people putting different units inside a vector.

The polar coordinate. We can convert our usual Cartesian coordinate (our 3D world) to the polar coordinate right? Remember in our cartesian coordinate, all basis vector has the unit of meter, but in polar coordinate, one basis vector has the unit of meter, but all others have the unit of radian (or angles). Clearly, meter and angle are different units.

Thank you for all the thought you put in in framing your response. I agree it doesnâ€™t make sense to think in terms of `(m, n, 1)`

when doing anything with `pandas`

and `sklearn`

. If I had to deal with higher dimensional data there, I would make a column multi-index and gain the benefits of vectorization without losing the dimensionality of that feature. I faced a situation like this when dealing with an outlier detection problem in the drug discovery area and each observation contained another level of depth in the data. About your third point, I always thought of units as being something the basis vector brings in. Since a vector is a tensor of rank 1, there can only be one basis vector and one index. In the case of a tensor of higher rank, I agree that the units can be mixed. I kind of think of `(m, n, )`

and `(m, n, 1)`

as treble clef and bass clef. I want to be a piano player while playing nice with the (mostly) guitar players and the odd bass player

Thank you for your patience with me!

Now I even understand more about your rationale. Indeed depth is a real thing, so your consideration is more general, but then whether you want to make it (m, n, k) or (m, n \times k) depends on your modeling assumption or your model architecture. If it is photo then we use CNN that expects to accept a depth of 3, otherwise we need to think about it. Perhaps sometimes we should flatten all or some depth out, and perhaps sometimes we just shouldnâ€™t.

I can understand you want to perserve the freedom to have depth information in your dataset now, and I am sure you can also see that the convention adopted in our courses is sufficient and useful.

Thank you @rmwkwok! One problem that I am fascinated by is as follows:

Imagine you are building a model which looks at a picture of a plot of y = f(x) for some f(x). The number of points in the plot is a variable and it is a scatter plot. The model should spot points which are outliers. Now imagine the same problem, but instead of a picture, you get x and y in p rows, with a total of m of these sets, where each row (in training) has as a label 0 or 1 for non-outlier or outlier.

It is this problem that I want to try and understand, which is why I am spending way more time understanding these concepts than I should :).

@rmwkwok has given elegant explanations for the points you have raised.

Allow me to bring in a slightly different angle for one of the points.

When we talk about n features being put side by side in a matrix (which is what you are not happy about because they dont have the same basis vector, thereby rendering them unit-less), let us question whether at any point in time the n features are being used together without the corresponding weight being multiplied to it - And there lies the equivalent of your basis vector!

When n1 is number of bedrooms, w1 is price per bedroom

When n2 is number of floors, w2 is price per floor.

When n3 is area of house in sqft, w3 is price per sqft

By doing w1 Ă— n1, w2 Ă— n2, w3 x n3 and so on for all the features and their corresponding weights, we have ensured that unit-wise they all align along the target â€śyâ€ť, and we have done it by preserving the respective units of each of the n features.

@shanup, I think I am guilty of circular reasoning to justify my use of `(m, n, 1)`

as opposed to `(m, n, )`

because I am implicitly assuming that `x_i`

and `w_i`

are vectors in a single dimensional space. Scalars can have dimensions too, and all dimensions are implicit here as you point out. The actual computation thatâ€™s happening is the same in terms of the dot product, the only difference being in the `(m, n, 1)`

scenario you end up doing `np.dot(w.T, x) + b`

. The reasoning that the formulas are more consistent in the `(m, n, 1)`

scenario doesnâ€™t hold up, as you can do `np.dot(w.T, x) + b`

in the `(m, n, )`

case too. For me, thinking about a `n`

vectors in `1`

dimension is easier than thinking about an `n`

dimensional vector as I canâ€™t interpret whatâ€™s happening to that vector even in the simplest of scenarios. I canâ€™t wait to learn how one would encode a vector such as velocity, with both magnitude and direction, and how you deal with the tensor X where the vectors that it comprises of have different shapes.