I am going through the videos and assignments in Week 2 currently. The question I have is, in the video “Deep learning for content-based filtering” its told that we could precompute Vm to quickly identify similar movies. How do we precompute Vm ? From the videos, I could see that we use a squared error cost function that needs Vu as well. Hence, how can we precompute Vm without the user network ?
My understanding is that we need a cost function to compute the output of a Neural network. So what function should we use if we are working with the movie network alone ?
Hello Krishna, welcome to our community. If you also check out the content of the assignment for Content-based filtering, first, you will see that we are using MeanSquaredError for the cost function. Second, we cannot train either just the movie network or just the user network alone, instead we need to train both at the same time because the cost is calculated based on the outputs from both networks. Once we have trained both, we can convert the movies into their movie vectors using the movie network.
Thanks for your explanation ! My question is more about the precomputing part discussed in the video. So if I have understood your explanation correctly, we train both the networks together (say with user j for all movies). We’ll then have Vu for user j and Vm for all movies. So if a new user shows up and watches a movie i, we can suggest related movies just by using the already computed Vm (without using the new user’s information). Is this the correct understanding ?
We need both an user vector and a movie vector to make a prediction, so as long as you have the vectors for your user and your set of movies - yes, you can make recommendations for the user from those movies. As for whether you want to retrain the model or not, it depends on the performance of your model’s prediction - if the performance degrades over time, you may want to retrain it.
“…one additional optimization is that if you have computed Vm for all the movies in advance, then all you need to do is to do inference on this part (Vu) of the neural network a single time to compute Vu. Then take that Vu they just computed for the user on your website right now and take the inner product between Vu and Vm.”
If we can’t train either the movie network or just the user network alone, may I know how should we understand this optimization mentioned in the video?
Actually the algorithm here is to train both at the same time, so we either have both or none of them. Training both at the same time is what makes it possible to consider a user and a movie on the same table, so it is not an option to just train the movie network or just train the user network. The performance of the model is also evaluated with the need of both networks so we won’t be having a good user network but a bad movie network or things like that, and in other words, we either have a good user and movie networks, or a bad user and movie networks. We can’t separate them.
Thanks for your very prompt reply! Yes, while I understand that both networks are to be trained at the same time, I still might have some problem wrapping my mind around this part of the video.
The question being:
Since they are trained and the vectors are retrieved together, if we have computed Vm, doesn’t that mean that we have also computed Vu? Then what does ‘do inference on this part of neural network a single time to compute Vu’ mentioned in the video mean?
I see. Certainly I can’t guess the consideration behind the scene of this video, but one possibility is that the users set are more dynamic than the movies set, for example, we can have on average a new movie per a day or two, but we can have 200 new users per hour. In this case, we are motivated to pre-compute all the Vms and store them in a databse, so that when a new user arrives, we will use the user network to retrieve the vector for the new user and compare this vector against all pre-computed movie vectors.
Certainly we can pre-compute all existing user vectors but we might not be very motivated to do so because (1) it can’t cover new users anyway, (2) not all existing user is going to use the service today, so that possibly 95% of pre-computed user vectors are actually not used before they get re-pre-computed again, however, this is not a problem for the movie vectors if we always recommend a movie from all existing movies. (3) we usually have way way way more users than movies, so pre-computing for all existing users can be quite costy.
However, I think you can always make up a strategy to balance between online inference lead time and pre-computation cost so that it is not an all-or-nothing thing.
There are some ambiquities in the lectures of content-based filtering. I try to list them as follows:
well, do we have to have the data of user and movie features, i.e. (x_u and x_m), as well as the ratings of users for each movies to start the process,i.e. finding the v_u and v_m vectors?
what about the sizes of x_u and x_m? size(x_u) = (number_users, number_features) and size (x_m) = (number_movies, number_features)? P.S. i know that number_features could be different for both. Or to be more precise:
now, we train x_u in user network and x_m in movie network to be able to find v_u and v_m. so far so good. but do we train users and movies one by one or in a matrix of size , lets say, number_users × number_features and number_movies × number_features ? I mean if we want to have a vector as an output of a network , then we have to use one single user and one single movie as input , otherwise we would have as output a matrix of size len(v_u) × number_users , if v_u is a 1d array, and a matrix of size len(v_m) × number_movies, where , of course , len(v_u) = len(v_m).
In the lecture video " Tensorflow implementation of content-based filtering"
In the video entitled " Collaborative filtering vs Content-based filtering", at 0:59, Prof Andrew mentions
“In other words, it requires having some features of each user, as well as some features of each item and it uses those features to try to decide which items and users might be a good match for each other. With a content-based filtering algorithm, you still have data where users have rated some items. Well, content-based filtering will continue to use r, i, j to denote whether or not user j has rated item i and will continue to use y i, j to denote the rating that user j is given item i if it’s defined.”
Another reference for the same can be found in the video entitled " Deep learning for content-based filtering" at 4:09
"What we’re going to do is construct a cost function J, which is going to be very similar to the cost function that you saw in collaborative filtering, which is assuming that you do have some data of some users having rated some movies, "
I suppose this solves the first ambiguity. Now, let’s come to the second ambiguity. I believe that in the lecture videos, x_u and x_m are always used to denote single user and movie feature vectors respectively, and not the matrices. Please do correct me if I am wrong. And if your ambiguity is about whether x_u and x_m has the same number of features or not, then Prof Andrew mentions this in the video entitled " Collaborative filtering vs Content-based filtering" at 5:27
“Notice that the user features and movie features can be very different in size. For example, maybe the user features could be 1500 numbers and the movie features could be just 50 numbers. That’s okay too.”
I hope this solves the second ambiguity.
Now, coming to the third one. I guess this is not much of an ambiguity, since it’s just like any other neural network. We can assume that a pair of <user, movie> rating as a single example. Now, you can simply decide the batch-size for the inputs, say we agree on 32. So, we will feed 32 user feature vectors into the user network, and the 32 corresponding movie feature vectors into the movie network, to get 32 pairs of <v_u, v_m>, for which we have 32 ratings as the true labels. I believe this resolves the third one.
Coming to the last one, I am a little unsure if you are referring to the activation functions for the last layers of the user_NN and the item_NN networks. If yes, then they use no activation function or in other words, the “linear” activation function, and that is why, no reference has been mentioned for the same. I hope this resolves the last ambiguity.
I also have a question on this slide and i add it here as it is on the same topic:
why does the output layer not have any activation function in user_NN or item_NN? in this way, how the output layer take the output from last hidden layer and get the vm or vu?