C3_W2_RecSysNN_Assignment dataset questions

Svetlana_Verthein · February 26, 2023, 11:34pm

I have a couple of questions on this assignment:

" The reduced dataset has 𝑛𝑢=397 users, 𝑛𝑚=847 movies". Later on we have this:

I thought “item_train” was an array of movies where each row is a separate movie ( with number of movie features for columns), of which we have 847? Where does the 50884 comes from, and what are these items?

This line of code produces 5 identical rows:

What does each row represent? User features per each movie (which would be all identical)? Why do we need an array like this instead of just using a 1-dimensional vector per user?
I am just confused about the data we are using for training and how we arrive at it from the original catalog of 397 users, 847 movies and 25521 ratings.

Thank you!

rmwkwok · February 26, 2023, 11:54pm

Hello @Svetlana_Verthein

Please check this thread out which will explain how we have gone from 25521 ratings to 50884 records.

For the 5 identical rows, please check this post out.

If you have other questions, let me know here.

Cheers,
Raymond

Svetlana_Verthein · February 27, 2023, 12:07am

Thank you, here is from that thread:
“If you de-duplicate the data, you get 25521 unique rows.” - De-duplicating 50884 comes to 25442, not 25521, 79 less. Does that mean mean that 25442 movie ratings were duplicated and 79 were not?
From second thread: “I think we can attack a data problem in many different ways, and the way this assignment adopt is to “expand” user-movie ratings into several rows” - so the user vector gets replicated nm times - for the ease of programming/calculations/understanding?

thank you

rmwkwok · February 27, 2023, 12:10am

@Svetlana_Verthein

Where do you get this number: 25442?

Svetlana_Verthein · February 27, 2023, 12:12am

because 50884 / 2 = 25442

rmwkwok · February 27, 2023, 12:12am

Why do you divide it by 2?

Svetlana_Verthein · February 27, 2023, 12:12am

Isn’t it what “de-duplicate” mean?

rmwkwok · February 27, 2023, 12:14am

No.

If you read that thread again, it said

Some ratings are repeated to boost the number of training examples of underrepresented genre’s.

First, it is “some ratings” get repeated, not “all ratings” get repeated. Also, getting repeated doesn’t necessarily mean “getting repeated once”.

“de-duplicate” means undo those repeating.

Clear about this?

We can move on to your next question after this one is cleared. Let me know.

Svetlana_Verthein · February 27, 2023, 12:19am

I see, yes, clear now

rmwkwok · February 27, 2023, 12:29am

For preparing the dataset to fit the requirement of the model.

The model requires us to give a PAIR of x_m and a x_u each time, so we pair them up. If an user has rated 5 movies, then there will be 5 pairs, and among the pairs, we will see identical information about that user.

Topic		Replies	Views
C3_W2_RecSysNN_Assignment - pprint_train() returns duplicates for userid Unsupervised Learning, Recommenders, Reinforcement week-2	8	682	March 18, 2023
C3_W2_Practice Lab 2. same user id gets displayed five times Unsupervised Learning, Recommenders, Reinforcement week-2	5	355	September 10, 2023
C3_W2_RecSysNN_Assignment_Dataset Unsupervised Learning, Recommenders, Reinforcement week-3	3	486	December 29, 2022
C3_W2_Assignment 2_Content based filtering Unsupervised Learning, Recommenders, Reinforcement week-2	2	388	October 25, 2023
C3_W2_RecSysNN_Assignment (user and item vectors shape) Unsupervised Learning, Recommenders, Reinforcement week-2	3	541	December 28, 2022

C3_W2_RecSysNN_Assignment dataset questions

Related topics