" The reduced dataset has đť‘›đť‘˘=397 users, đť‘›đť‘š=847 movies". Later on we have this:

I thought â€śitem_trainâ€ť was an array of movies where each row is a separate movie ( with number of movie features for columns), of which we have 847? Where does the 50884 comes from, and what are these items?

This line of code produces 5 identical rows:

What does each row represent? User features per each movie (which would be all identical)? Why do we need an array like this instead of just using a 1-dimensional vector per user?
I am just confused about the data we are using for training and how we arrive at it from the original catalog of 397 users, 847 movies and 25521 ratings.

Thank you, here is from that thread:
â€śIf you de-duplicate the data, you get 25521 unique rows.â€ť - De-duplicating 50884 comes to 25442, not 25521, 79 less. Does that mean mean that 25442 movie ratings were duplicated and 79 were not?

From second thread: â€śI think we can attack a data problem in many different ways, and the way this assignment adopt is to â€śexpandâ€ť user-movie ratings into several rowsâ€ť - so the user vector gets replicated nm times - for the ease of programming/calculations/understanding?

Some ratings are repeated to boost the number of training examples of underrepresented genreâ€™s.

First, it is â€śsome ratingsâ€ť get repeated, not â€śall ratingsâ€ť get repeated. Also, getting repeated doesnâ€™t necessarily mean â€śgetting repeated onceâ€ť.

â€śde-duplicateâ€ť means undo those repeating.

Clear about this?

We can move on to your next question after this one is cleared. Let me know.

For preparing the dataset to fit the requirement of the model.

The model requires us to give a PAIR of x_m and a x_u each time, so we pair them up. If an user has rated 5 movies, then there will be 5 pairs, and among the pairs, we will see identical information about that user.