Hello, Could someone please clarify why we used redundant User data (i.e. same user id with same data as well in duplicate rows) in the content-based filtering programming assignment?
Thanks
@Muhammad-Kalim-Ullah, this is just an implementation choice, I think to simplify the implementation for the assignment.
Basically, load_data()
treats item_train, user_train, and y_train as different portions of one big table where each row represents a user+movie+rating combo (e.g. the first row represents the first user, the first movie, and that user’s rating for that movie, etc).
This implementation makes it easy to match the target user ratings up with the appropriate user+movie pairs. For example, when splitting the training and test sets, as long as you make the same splits for your item_train, user_train, and y_train the same and you’ll be assured you have y values that go with the corresponding user/item pairs.
But it is not required to use the duplicate rows, I think it’s just to illustrate that the user and movie item has an equal number of instances. Am I right?
Because duplicate rows (completely) do not make sense unless balancing the datasets of user_item and movie_item.
Hello @Muhammad-Kalim-Ullah, please check out this post for how we come up with appearently duplicated user rows.