" The reduced dataset has 𝑛𝑢=397 users, 𝑛𝑚=847 movies". Later on we have this:
I thought “item_train” was an array of movies where each row is a separate movie ( with number of movie features for columns), of which we have 847? Where does the 50884 comes from, and what are these items?
This line of code produces 5 identical rows:
What does each row represent? User features per each movie (which would be all identical)? Why do we need an array like this instead of just using a 1-dimensional vector per user?
I am just confused about the data we are using for training and how we arrive at it from the original catalog of 397 users, 847 movies and 25521 ratings.
Thank you, here is from that thread:
“If you de-duplicate the data, you get 25521 unique rows.” - De-duplicating 50884 comes to 25442, not 25521, 79 less. Does that mean mean that 25442 movie ratings were duplicated and 79 were not?
From second thread: “I think we can attack a data problem in many different ways, and the way this assignment adopt is to “expand” user-movie ratings into several rows” - so the user vector gets replicated nm times - for the ease of programming/calculations/understanding?
Some ratings are repeated to boost the number of training examples of underrepresented genre’s.
First, it is “some ratings” get repeated, not “all ratings” get repeated. Also, getting repeated doesn’t necessarily mean “getting repeated once”.
“de-duplicate” means undo those repeating.
Clear about this?
We can move on to your next question after this one is cleared. Let me know.
For preparing the dataset to fit the requirement of the model.
The model requires us to give a PAIR of x_m and a x_u each time, so we pair them up. If an user has rated 5 movies, then there will be 5 pairs, and among the pairs, we will see identical information about that user.