Content-Based Filtering lab question

Hi - have a question on the implementation in the lab.

Why does each movie have multiple entries in the training set? I get that they are represented as one-hot vectors, but (using the lab’s example) any reason we cannot have movie ID 6874 have 1s for Action, Crime and Thriller in the same line vs. in separate lines?

Also, wouldn’t this decrease the effectiveness of the algo. given we are losing information about the same movie having multiple genres? What am I missing?


Hi @7arunb

maybe this will be helpful:

Thanks @Lukasz_S

I have a simlar question as the second comment on the stackexchange thread you linked to - if a movie has multiple genres, why not represent it as [1,0,0,1] vs two rows of [1,0,0,0] and [0,0,0,1]?