Week 2 Content based filtering

I was trying to understand how would model learn the parameters
vu and vm

in context to collaborative filtering would it be identical to how we learned w and x but now its just that the vu and vm are the outputs of neural networks so there would be more complicated calculations to actually get the results of all the derivatives and etc.?
if i understand it correctly i would appreciate if some1 could say it haha
if not tell id appreciate if you could tell me how it would work
to summarize i understand it that the gradient descent would be identical to the one from collaborative filtering but now with vu and vm parameters to update instead of w,x,b

Hello @Michal_Bober,

That’s an interesting angle!

I believe you have got the idea. Here is my version: in content-based, we need a few layers to convert existing features into some useful non-linear features (vu and vm). In collaborative filtering, there is no existing features. w and x are the useful features that are learnt, and if your w and x are already obtained through learning, there is no need to a learn additional layer to convert them into something else.


Thanks for the response I would like to know more about the process how model would learn these parameters how gradient descent would work here for example

Hello @Michal_Bober,

Just like a Linear regression problem can use gradient descent to learn parameters to achieve the minimum mean squared cost, and Logistic regression problem can use gradient descent to learn parameters to achieve the minimum logistic cost, the content-based filtering problem in the Lab uses gradient descent to learn parameters to achieve the minimum mean squared cost.

Each training step moves the trainable parameters a bit towards to a minimum cost. This is what a gradient descent step does. After many steps, hopefully, we get to a set of parameters that minimizes well the cost.

For more details, I suggest you to review Course 1 (Week 1 and Week 3) for the ideas behind gradient descent in a Linear regression problem and a Logistic regression problem. However, it is just the same idea as what we are doing in Content-based.


In content-based filtering, we must compute both “w” and “b” values, similar to collaborative filtering based on users or movies. Should we calculate the “w” and “b” values per movie or users ? or what ? i’m bit confuse on that if i want to implement content based filtering from scratch do i need to calculate w or b per users or movies?

Hi @maDan_kD,

Content based filtering and Colaborative filtering are two different approaches. I recommend you to review the lectures but I can give you a brief summary here.

Collaborative filtering assumes one set of trainable parameters per movie and one set per user because neither of them comes with any feature. In Content-based filtering, however, every user and movie comes with a set of features, so we do not have one set of trainable weights for each user or movie, instead, we have one neural network for users (NOT per user) and one network for movies (NOT per movie), and the networks convert an user’s existing features into an embedding, and convert a movie’s existing features into another embedding.

In Collaborative filter, the comparison of an user and an movie is through applying a similarity function on the trainable weights of the user and the movie, however, in Content-based, the comparison is through applying a similarity function on the converted embeddings of the user and the movie.

I hope you have seen the difference - for Collaborative, we learn the embeddings directly, but for Content-based, we learn neural networks to convert features into embeddings.

Please review the lectures given my answer in mind, and if you still have follow-ups, let me know here.


So if users and movies have 100 features we need to calculate 100 weights for all users and 100 weights for all movies rather then 100 weights for every user or movie. may i correct?

If you are still asking about Content-based filtering, then the number of weights depends on your neural network architecture. You build a neural network that accepts the given features, and output an embedding. The neural network contains weights, and the number of weights depends on the architecture of it. You have more weights if you have more layers. You have more weights if you have more nodes. The number of layers and number of nodes are your design, and are justified by performance, so no one can tell you how many layers, how many nodes, or how many weights it is going to take to make good neural networks that do the job well.

I also suggest you to go through the assignment for content-based filtering, because you will be asked to build some neural networks, and you can count the number of weights there. Keep in mind that the architecture is free to be changed as long as you want to try a different one with the aim of improving the performance, so no matter how many weights you see in that assignment, don’t take that as a hard rule. It is going to be just one of the many possible choices.

oh thanks now i understand it.


If it is OK, I am going to use the same topic because I have a related questions. How do we know that the Vu and Vm are actually correlated. Isn’t that a wild guess. We are basically relying on the fact that the initial 128 user features considered will somehow resume in something that will map into 32 features, and that the initial 256 movie features considered will resume in also 32 features that will correlate 32 movie features. Is that correct or there is more to it?

Hello @Thierry_Kouthon,

We will know they are helpless if the performance turns out to be super poor. That’s the idea - you let the model performance to say for itself.

Ofcourse, we human can do as much as we can to supply Vu and Vm with features that we believe to useful in predicting the rating - note that we don’t need to make sure the features of user to be correlated with movie anyhow, though we don’t need to exclude them either. The idea is, we supply the features, gradient descent will decide how to convert them into the same 32 dimensional space that will yield good predictions.

So, to answer your question, we don’t know for sure at first, but we will find out as we monitor how the performance improves over the training process.


Hi Raymond,

Sorry, for getting so late on this. You explanation makes sense. It is basically trial and error.



No worries, Thierry. We all visit here at the right time :wink:

And I agree with you.