C3_W2_RecSysNN_Assignment - pprint_train() returns duplicates for userid

stesoye · September 2, 2022, 12:27pm

Hello,

I noticed that the code snippet “pprint_train(user_train, user_features, uvs, u_s, maxcount=5)” returns the same user 5 times.

I assume this is an error?

Also I don’t under stand why movies with multiple genre have a training vector per genre?

I can see that these are not duplicates, but why not have a single line for each movie? So movie ID 6874 would have one row with a value of 1 for each of ‘Action’, ‘Crime’ and ‘Thriller’

I see later on that this results in multiple (possibly different) predictions for the same movie;

So what would be considered to be the final prediction?

Thanks,
Stephen

rmwkwok · September 3, 2022, 2:23am

Note: the assignment has been updated so that the following answer is no longer relevant. Please jump to this post for the latest explanation for why userid is duplicated in the user data table.

Hello Stephen, I think we can attack a data problem in many different ways, and the way this assignment adopt is to “expand” user-movie ratings into several rows. For example, if a user rated 3 movies and each movie has 3 genres associated, then you will end up seeing 1 (user) x 3 (movies) x 3 (genres) = 9 rows of data in both the user and the item data tables, and because all these 9 rows belong to the same user, you will see 9 identical user rows in the user table, and in the corresponding item table rows, you will see 3 movie ids repeated 3 times each but among the rows for one movie id, different genres are selected.

The assignment should have used the trained model for making suggestions, after the line in your last screenshot, so you can check out how the assignment uses the result

Wendy · September 27, 2022, 9:23pm

One note: The assignment has now been updated to include all of a movie’s genres in the same row.

rmwkwok · September 27, 2022, 10:58pm

Thank you Wendy!

Raymond

Martin_Thomas_Mathew · March 16, 2023, 9:23pm

Hi Raymond, I have a clarification on this statement. In item_train, I see movie ids repeat but the genre selection is the same. I would think the movie id would repeat only with applicable genres selected.

Here is an example from the raw data:
74458,2010,4.022388059701493,0,0,0,0,0,0,0,1,0,0,1,0,0,1
74458,2010,4.022388059701493,0,0,0,0,0,0,0,1,0,0,1,0,0,1
74458,2010,4.022388059701493,0,0,0,0,0,0,0,1,0,0,1,0,0,1
74458,2010,4.022388059701493,0,0,0,0,0,0,0,1,0,0,1,0,0,1
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0
77455,2010,4.038461538461538,0,0,0,0,1,0,1,0,0,0,0,0,0,0

The copy paste above is from the data of Dec 2022. That is when I did the course. Perhaps its changed since then.

Can you please clarify if possible? Thank you

rmwkwok · March 16, 2023, 9:33pm

Hello @Martin_Thomas_Mathew,

You are right that the data has been changed some time ago. Now we have something like this:

Consequently, the maths in my earlier post has to change to this:

For example, if a user rated 3 movies and each movie has 3 genres associated, then you will end up seeing 1 (user) x 3 (movies) = 3 rows of data in both the user and the item data tables, and because all these 3 rows belong to the same user, you will see 3 identical user rows in the user table, and in the corresponding item table rows, you will see 3 movie ids.

Thank you, @Martin_Thomas_Mathew, for bringing this up so that I can clarify it.

Cheers,
Raymond

Martin_Thomas_Mathew · March 17, 2023, 7:09pm

Hello Raymond,

Thank you so much for your response. I appreciate it. I apologize for my further clarifications.
So in the Dec 2022 data, the 1st 5 row are the same as in your screenshot. But if I expand to say 1st 15 rows, I see duplicates. Does that mean in the latest version of the class and data, these duplicates are not there?
My screenshot below.

Similarly, for user_train, there are many user repeats more than the number of movies they rated. For example, userid 2 rated 22 movies. I would expect only 22 identical rows in user_train. But in user_train, userid - 2 repeats more than 22 times. 69 times in total i think. Why would this be the case?

Your advise and clarification will be greatly appreciated.

Thank you,
Martin

rmwkwok · March 18, 2023, 2:08am

Hello Martin,

The rows in the user table and the rows in the movie table are one-one corresponded. A user row repeats as many time as there are movies rated by that user. Under this rule, movie rows shouldn’t be repeated under the name of the same user, right? However, according to the lab’s text in Section 3.1,

We therefore will see some movie (& its user) rows get duplicated.

Cheers,
Raymond

Martin_Thomas_Mathew · March 18, 2023, 12:13pm

Thank you so much Raymond for clarification. It is much appeciated. Sincerely.

Topic		Replies	Views
C3_W2_RecSysNN_Assignment_Dataset Unsupervised Learning, Recommenders, Reinforcement week-module-3	3	488	December 29, 2022
C3_W2_Practice Lab 2. same user id gets displayed five times Unsupervised Learning, Recommenders, Reinforcement week-module-2	5	357	September 10, 2023
C3_W2_RecSysNN_Assignment dataset questions Unsupervised Learning, Recommenders, Reinforcement week-module-2	9	560	February 27, 2023
C3_W2_Assignment 2_Content based filtering Unsupervised Learning, Recommenders, Reinforcement week-module-2	2	390	October 25, 2023
【C3_W2_RecSysNN_Assignment】Why one-hot coding for movie genre? Unsupervised Learning, Recommenders, Reinforcement week-module-2	5	552	August 9, 2022

C3_W2_RecSysNN_Assignment - pprint_train() returns duplicates for userid

Related topics