# C3_W2_RecSysNN_Assignment dataset questions

I have a couple of questions on this assignment:

1. " The reduced dataset has đť‘›đť‘˘=397 users, đť‘›đť‘š=847 movies". Later on we have this:

I thought â€śitem_trainâ€ť was an array of movies where each row is a separate movie ( with number of movie features for columns), of which we have 847? Where does the 50884 comes from, and what are these items?

1. This line of code produces 5 identical rows:

What does each row represent? User features per each movie (which would be all identical)? Why do we need an array like this instead of just using a 1-dimensional vector per user?
I am just confused about the data we are using for training and how we arrive at it from the original catalog of 397 users, 847 movies and 25521 ratings.

Hello @Svetlana_Verthein

Please check this thread out which will explain how we have gone from 25521 ratings to 50884 records.

For the 5 identical rows, please check this post out.

If you have other questions, let me know here.

Raymond

1. Thank you, here is from that thread:
â€śIf you de-duplicate the data, you get 25521 unique rows.â€ť - De-duplicating 50884 comes to 25442, not 25521, 79 less. Does that mean mean that 25442 movie ratings were duplicated and 79 were not?

2. From second thread: â€śI think we can attack a data problem in many different ways, and the way this assignment adopt is to â€śexpandâ€ť user-movie ratings into several rowsâ€ť - so the user vector gets replicated nm times - for the ease of programming/calculations/understanding?

@Svetlana_Verthein

Where do you get this number: 25442?

because 50884 / 2 = 25442

Why do you divide it by 2?

Isnâ€™t it what â€śde-duplicateâ€ť mean?

No.

Some ratings are repeated to boost the number of training examples of underrepresented genreâ€™s.

First, it is â€śsome ratingsâ€ť get repeated, not â€śall ratingsâ€ť get repeated. Also, getting repeated doesnâ€™t necessarily mean â€śgetting repeated onceâ€ť.

â€śde-duplicateâ€ť means undo those repeating.