To do prediction, this w(j) dot x(i) + b(j), can it also be something like w1x1+w2x2+w3x1x2+b?

Yes, indirectly. You can compute the additional features and store them within the X matrix. They do not need to be listed separately if you use matrix notation.

Hello @Jinyan_Liu,

Since you are focusing on prediction, I want to share that, x^{(i)} and w^{(i)} are learnt in the training process such that x^{(i)}\cdot w^{(i)} + b^{(j)} is clost to y^{(i,j)}. If you have trained them in the above manner but predict on a different way, then the result won’t be guaranteed to be close to y^{(i,j)}. We train the model to be good at doing a particular task, and using the model on a different task (like predicting with a different equation) is risky.

Cheers,

Raymond

Thanks!

Thanks!

Sorry for the confusion! I meant for also training with w1x1+w2x2+w3x1x2+b.

So basically it’s like Linear Regression that I can create new features by adding x degree to fit data better(and also try to avoid overfitting).

If you were still discussing this in the context of collaborative filtering, each x^{(i)} is for one movie, and only one x^{(i)} is necessary for one movie. Having multiple x^{(i)} for the same movie is not necessary and is redundant.

Note my notation - I am talking about x^{(i)} and the lecture is using x^{(i)}. I am not sure what x1 or x2 are - are the 1 and the 2 from x1 and x2 subscripts or superscripts? If you have any followup regarding this reply, it would be great if you would ask about it using the same notation.

Also, x^{(i)} is a vector, and w^{(j)} is a vector. We dot them together to get a scalar. How are you going to have three vectors arranged together in one term - like one w and two x? Think again.

Ah yes I should make superscript and subscript more clear!

My question comes from:

x^{(i)}\cdot w^{(j)} + b^{(j)} is a “straight” line. But I think a user’s ratings shouldn’t be a “straight” line. Then to fit in a “curved” line, can we also add degrees to x, like what we learned before?

By “add degrees to x”, I mean:

e.g., a movie has only 2 features x_{1} and x_{2};

x_{1}^{2}, x_{2}^{2} and x_{1}x_{2} are engineered features for the movie.

So we train and predict this:

x^{(i)}_{1}\cdot w^{(j)}_{1} + x^{(i)}_{2}\cdot w^{(j)}_{2} + x^{2(i)}_{1}\cdot w^{(j)}_{3}+ x^{2(i)}_{2}\cdot w^{(j)}_{4}+x^{(i)}_{1} x^{(i)}_{2}\cdot w^{(j)}_{5}+b^{(j)}

Can we do this in Collaborative Filtering?

And if we do this, would it fit the data better than

x^{(i)}_{1}\cdot w^{(j)}_{1} + x^{(i)}_{2}\cdot w^{(j)}_{2} +b^{(j)} ?

Hello @Jinyan_Liu,

I think there is a misunderstanding here. Generally, when we speak about whether it is straight or not, we are discussing with respect to a certain feature space that is formed by the **GIVEN** features.

When we first learn ML with the example of housing price, for example, we are given some features (like area, floor, years, and etc.) , and we know there that **relative to** those given features, the price might not be linear to them. Then what do we do? We use multiple **HIDDEN** layers to convert those **GIVEN**, explicit features to some implicit features which is finally fed into the **OUTPUT** layer for a **LINEAR** judegement.

You see? The output layer is linear. We use the linear activation for the output layer, right?

Now, in collaborative filtering, we do NOT have any GIVEN features. We skip the hidden layers because there is no need to convert any given features, and instead, we have the output layer directly! What is fed into the output layer this time? w and x!

See the difference? In the housing price problem, we convert given features into implicit features (which is non-linear relative to the given features) by the hidden layer. In collaborative filtering, we learn the implicit features directly.

Both the implicit features in CF and in housing price is LINEAR to the prediction. There is no difference between them in that.

Therefore, there is no need to add those terms.

This is equivalent to x^{(i)} \cdot w^{(j)} . See if you can find out why ? If you can tell me why and how, then we will move on to those “non-linear” terms.

Try it out yourself.

I couldn’t understand why x^{(i)} \cdot w^{(j)} is equivalent to x^{(i)} \cdot w^{(i)} ?

If so, x^{(i)} \cdot w^{(i)} ‘s value is static per movie?

Hey @Jinyan_Liu, I have fixed a typo. Please read my previous post again. One way to think is to start by writing down on a piece of paper some simple example x (for a movie) and w (for an user) for the two equivalent cases, and do the dot products. Good luck.

Yes.

x^{(i)} \cdot w^{(j)} is equal to x^{(i)}_{1}\cdot w^{(j)}_{1} + x^{(i)}_{2}\cdot w^{(j)}_{2}.

(Now I notice I used dot product for x^{(i)}_{1}\cdot w^{(j)}_{1} + x^{(i)}_{2}\cdot w^{(j)}_{2}! I actually meant x^{(i)}_{1}* w^{(j)}_{1} + x^{(i)}_{2}* w^{(j)}_{2} )

Are you sure you want to use element wise multiplication? You need to produce a scalar to represent the rating.

To continue this discussion, please propose the whole equation again, with the correct AND your desired type of multiplication. Make sure it can produce a scaler though

Sorry that was typo. I meant to write in previous posts:

x^{(i)}_{1}* w^{(j)}_{1} + x^{(i)}_{2}* w^{(j)}_{2} + x^{2(i)}_{1}* w^{(j)}_{3}+ x^{2(i)}_{2}* w^{(j)}_{4}+x^{(i)}_{1} x^{(i)}_{2}*w^{(j)}_{5}+b^{(j)}

And after reading your explanation post, I understand now x^{(i)}\cdot w^{(j)} + b^{(j)} is just fine!

Alright. Let me just wrap this up. First, note that element-wise multiplication won’t produce a scalar as it should to predict a scalar rating. Your original dot product equation is more reasonable.

Second, if we go back to your dot product equation, after knowing that they are equivalent, we can look at the those additional “non-linear” terms. Those terms don’t use any new weights but the weights from x_1^{(i)}, x_2^{(i)}. This means that they don’t offer any additional degree of freedom to the model. In other words, with or without the non-linear terms, the models have the same number of trainable weights to tune themselves to the training dataset. Although the resulting models will always be different, the fact that they have the same degree of freedom isn’t very convincing for the “non-linear” model to be more promising, given what I have explained earlier.

Cheers,

Raymond

Yes. Now I understand! Thank you so much for explaining it!