C3_W2_RecSysNN_Assignment dataset questions

I have a couple of questions on this assignment:

  1. " The reduced dataset has 𝑛𝑢=397 users, 𝑛𝑚=847 movies". Later on we have this:

image

I thought “item_train” was an array of movies where each row is a separate movie ( with number of movie features for columns), of which we have 847? Where does the 50884 comes from, and what are these items?

  1. This line of code produces 5 identical rows:
    image
    What does each row represent? User features per each movie (which would be all identical)? Why do we need an array like this instead of just using a 1-dimensional vector per user?
    I am just confused about the data we are using for training and how we arrive at it from the original catalog of 397 users, 847 movies and 25521 ratings.

Thank you!

Hello @Svetlana_Verthein

Please check this thread out which will explain how we have gone from 25521 ratings to 50884 records.

For the 5 identical rows, please check this post out.

If you have other questions, let me know here.

Cheers,
Raymond

  1. Thank you, here is from that thread:
    “If you de-duplicate the data, you get 25521 unique rows.” - De-duplicating 50884 comes to 25442, not 25521, 79 less. Does that mean mean that 25442 movie ratings were duplicated and 79 were not?

  2. From second thread: “I think we can attack a data problem in many different ways, and the way this assignment adopt is to “expand” user-movie ratings into several rows” - so the user vector gets replicated nm times - for the ease of programming/calculations/understanding?

thank you

@Svetlana_Verthein

Where do you get this number: 25442?

because 50884 / 2 = 25442

Why do you divide it by 2?

Isn’t it what “de-duplicate” mean?

No.

If you read that thread again, it said

Some ratings are repeated to boost the number of training examples of underrepresented genre’s.

First, it is “some ratings” get repeated, not “all ratings” get repeated. Also, getting repeated doesn’t necessarily mean “getting repeated once”.

“de-duplicate” means undo those repeating.

Clear about this?

We can move on to your next question after this one is cleared. Let me know.

I see, yes, clear now

For preparing the dataset to fit the requirement of the model.

image

The model requires us to give a PAIR of x_m and a x_u each time, so we pair them up. If an user has rated 5 movies, then there will be 5 pairs, and among the pairs, we will see identical information about that user.