C3_W2_RecSysNN_Assignment - pprint_train() returns duplicates for userid

Hello,

I noticed that the code snippet “pprint_train(user_train, user_features, uvs, u_s, maxcount=5)” returns the same user 5 times.

I assume this is an error?

Also I don’t under stand why movies with multiple genre have a training vector per genre?

I can see that these are not duplicates, but why not have a single line for each movie? So movie ID 6874 would have one row with a value of 1 for each of ‘Action’, ‘Crime’ and ‘Thriller’

I see later on that this results in multiple (possibly different) predictions for the same movie;

So what would be considered to be the final prediction?

Thanks,
Stephen

1 Like

Note: the assignment has been updated so that the following answer is no longer relevant. Please jump to this post for the latest explanation for why userid is duplicated in the user data table.

Hello Stephen, I think we can attack a data problem in many different ways, and the way this assignment adopt is to “expand” user-movie ratings into several rows. For example, if a user rated 3 movies and each movie has 3 genres associated, then you will end up seeing 1 (user) x 3 (movies) x 3 (genres) = 9 rows of data in both the user and the item data tables, and because all these 9 rows belong to the same user, you will see 9 identical user rows in the user table, and in the corresponding item table rows, you will see 3 movie ids repeated 3 times each but among the rows for one movie id, different genres are selected.

The assignment should have used the trained model for making suggestions, after the line in your last screenshot, so you can check out how the assignment uses the result :wink:

3 Likes

One note: The assignment has now been updated to include all of a movie’s genres in the same row.

1 Like

Thank you Wendy!

Raymond

Hi Raymond, I have a clarification on this statement. In item_train, I see movie ids repeat but the genre selection is the same. I would think the movie id would repeat only with applicable genres selected.

Here is an example from the raw data:
74458,2010,4.022388059701493,0,0,0,0,0,0,0,1,0,0,1,0,0,1
74458,2010,4.022388059701493,0,0,0,0,0,0,0,1,0,0,1,0,0,1
74458,2010,4.022388059701493,0,0,0,0,0,0,0,1,0,0,1,0,0,1
74458,2010,4.022388059701493,0,0,0,0,0,0,0,1,0,0,1,0,0,1
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0

The copy paste above is from the data of Dec 2022. That is when I did the course. Perhaps its changed since then.

Can you please clarify if possible? Thank you

Hello @Martin_Thomas_Mathew,

You are right that the data has been changed some time ago. Now we have something like this:

Consequently, the maths in my earlier post has to change to this:

For example, if a user rated 3 movies and each movie has 3 genres associated, then you will end up seeing 1 (user) x 3 (movies) = 3 rows of data in both the user and the item data tables, and because all these 3 rows belong to the same user, you will see 3 identical user rows in the user table, and in the corresponding item table rows, you will see 3 movie ids.

Thank you, @Martin_Thomas_Mathew, for bringing this up so that I can clarify it.

Cheers,
Raymond

Hello Raymond,

Thank you so much for your response. I appreciate it. I apologize for my further clarifications.
So in the Dec 2022 data, the 1st 5 row are the same as in your screenshot. But if I expand to say 1st 15 rows, I see duplicates. Does that mean in the latest version of the class and data, these duplicates are not there?
My screenshot below.

Similarly, for user_train, there are many user repeats more than the number of movies they rated. For example, userid 2 rated 22 movies. I would expect only 22 identical rows in user_train. But in user_train, userid - 2 repeats more than 22 times. 69 times in total i think. Why would this be the case?

Your advise and clarification will be greatly appreciated.

Thank you,
Martin

Hello Martin,

The rows in the user table and the rows in the movie table are one-one corresponded. A user row repeats as many time as there are movies rated by that user. Under this rule, movie rows shouldn’t be repeated under the name of the same user, right? However, according to the lab’s text in Section 3.1,

We therefore will see some movie (& its user) rows get duplicated.

Cheers,
Raymond

Thank you so much Raymond for clarification. It is much appeciated. Sincerely.