Favoring recently watched movies in recommender systems

For optimizing performance of content based filtering, and making the performance acceptable, the course describes a mechanism for retrieval of movies close to what the user has recently watched. That does have another advantage in that recently watched movies are more suggestive of the recent “mood” of the user, and therefore suggest movies that the user will likely want to pick next.

Is there another, better way of separating out the recently watched movies from the ones that were watched before? Is it possible to weigh these differently? - the algorithm is afterall going to give equal weightage to all movies that the user has watched. The information that the recent movies are more likely indicative of the user’s next choice is simply unknown to the algorithm.

(1) Is there a way to introduce the time the movie was watched by the user into the input feature set?

(2) Even if we do manage to come up with something, that really is a feature that seems to involve both the movie and the user - so, in the case of content based filtering, seems we can’t put it down as a feature of either the movie or the user? Therefore a related question then is how do we fit features that are a result of the combination of the movie and the user (and not a feature of the user alone, or the movie alone) into a system of content based filtering?

(3) On a somewhat related note - is it possible to influence the regularization parameters (or something like it) to increase the weights the system is going to provide to some of the features? For exactly these cases where we know better than the algorithm. So for instance, in the above case, we can dynamically setup the regularization parameter for W(k) at the input layer based when the movie was watched - e.g., (lambda / 2) * time-factor * w(k)**2. Where time-factor changes based on when the movie was watched.

Thanks in advance - excellent set of courses - thoroughly enjoyed every bit of it!

Hello @QuarkNJaguar,

Welcome to this community!

I am not going to discuss (1), (2) & (3) for now, because I want to raise a question first: if we add time as a feature, then that means when we make prediction (recommendation), we will need to supply a time to the model as well. My question is, on prediction, what should the time be? I think this question is important, because if we don’t know what it should be, then adding time as a feature is not a feasible move, and therefore discussing (1), (2) & (3) under this context is not a good idea.

On the contrary, I would suggest you to add a weight to each sample (note that although we have a separate user and movie table, each row of them are one-one corresponded, so the first row of the user table and the first row of the movie table form a sample). Then we use these sample weights in the cost function INSTEAD of as a feature to the model. The idea is, “recent” samples weigh more in the cost function, so the trained algorithm will bias more to the recent samples.

Now we look at the notebook again, it uses tf.keras.losses.MeanSquaredError() as the cost function, and the good news is it accepts sample_weights (Refer to this documentation). Therefore, when calling .fit(...) to train the model, we can supply an additional argument called sample_weight (see this), and then we give it the sample weights just like we give the .fit(...) our samples.

Using sample_weight is one way to emphasize on the “recent” samples. Simply removing outdated samples is another. Both way doesn’t require additional features that are not easy to define at the time of prediction (making recommendation).

What do you think?

Cheers,
Raymond

Hi Raymond,

I see your point in the first paragraph around time being a feature, and the rest of the message that proposes adding it as a weight - thanks for helping me understand.

I now get how adding it as a weight to the MeanSquaredError cost evaluation would work - like you mention the prediction Y(i,j) is afterall for a user + movie combination, and we can simply find the time when the user watched the movie and create an appropriate weight to pass to that function. That sounds like a simple solution to the problem I had.

Getting back to your question in the first paragraph - I suppose the only real advantage (of adding time as a feature) over the above simpler solution (of adding it as a weight) would really be around the training itself. There’s an opportunity to “learn” the effect of recently watched movies instead of trying to arbitrarily come up with a generalized metric around the “mood” of the user.

So during the training, the time could be when (in say, Julian days) a specific user watched a movie. During the prediction, the time used would be the current time (in say, Julian days) when the user is about to watch the recommendation. The problem then would be to somehow introduce it as a feature so that the algorithm can automatically determine the weight it needs. It cannot be added to Ym or Yu since they are a contrived set of values that bear no resemblance to anything that’s a recognizable feature of a user or a movie.

In anycase, that gets me back to the problem of how to add such a thing as a feature. It’s a feature for the combination of the user and movie - i.e., it has the same dimension as the rating matrix (nu, nm), but is not really the training value, but just another feature that’s to be given an appropriate weight.

Hmm… Perhaps we can start with the pre-learnt movie model, and introduce as a new input feature the time (again, in Julian days) the user has seen each of the specific movies? In the interests of performance, it can perhaps start with a zero weight for this new feature, the previously learnt weights for the other movie features, and come up with a new set of weights solving for the entire network.

  • When a prediction of a rating needs to be made, the input could be the current time the user is about to see the new movie.
  • When the closest movies are to be found based on a selection, || vm(k) - vm(i)||2 would inconsistently include/exlude the time component for movies that were (just) watched. Not sure what to do about that.

Not sure if I just confused myself there :slight_smile: - please let me know if you had additional thoughts.

Thanks again!

Hello @QuarkNJaguar,

Here are my follow-up concerns about the idea itself. I will make it plain.

My follow-up # 1:

In this case, for the value of the time feature, it is always larger in the prediction input than the largest value seen in the training set. Do you agree with this?

If you agree, you will be asking the model to extrapolate in the domain of the time feature. Do you agree with this?

Neural network is good at intrapolation, but I am not sure it is as good for extrapolation.

My follow-up # 2

It seems to me that you are suggesting that having such a feature has an effect of favoring the results to “recent” movie. However, since Neural Network is a non-linear function, how the output varies with the input time feature cannot be guaranteed. If such relation is non-linear (which is very very much likely), and if the non-linearity is that it has 10 turning points (which is one of the possible non-linear form), there is no guarantee that it speaks “the more recent the better”. To guarantee that you probably want a linear relationship with respect to the time domain, but how can you guarantee that to happen? Remember, you mean the time feature to carry such characteristics in the model.

See, adding a feature is just part of the beginning process, the justification behind is another part. I am not doubting you can add a feature, or any feature, but the justification, or why can it even work?

This is not good enough to guarantee that time will have to carry the power you are hoping for.

Below is not a follow-up but a response:

I think we should think about this from a different angle. The meaning of the two tables are pretty artificial and it carries no importance to the algorithm itself. It is totally your decision to make when it comes to adding any feature to any one of the two tables. I don’t see any programmatic problem if you add the time feature into the movie table, for example. On contrary, I think the problem is more about the justification.

Note: even for the same movie, we have one row per user, such that if 10 users viewed the same movie, we have 10 rows for that movie in the movie table.

Raymond

Hello @QuarkNJaguar, I hope you are still thinking about this :wink:

Let me give a few more different perspectives:

  1. Our NN predicts the ratings and our NN is trained to minimize the error between the predicted and the true ratings. Right? So any new input feature will be for helping the prediction of the correct ratings. Agree? So without changing the meaning of the output and the cost function, any additional feature will not serve for a different purpose. So adding a time won’t be favoring recently watched movie. Adding a time, instead, will only let the NN think about how to use it to predict the right ratings.

  2. Using “sample weights” does not change the above behavior, but to focus the training on “recent” samples, such that we believe the NN’s prediction reflects more those “recent” samples.

  3. As said, adding a time will only be for the NN to use it for predicting the ratings. This is not a bad idea. In fact, I would consider to add either (i) time_of_watching - time_of_movie_release or (ii) both time_of_watching and time_of_movie_release as new features, and see how they are relevant to the prediction of ratings. Certainly, it won’t achieve your goal. it’s possible that, after training, we find the time to be irrelvant, or very relevant - we can’t control it.

We sometimes call the cost function as the “objective function” because it is the training process’s objective to minimize the cost. Here the objective is to minimize the rating’s error, but not to favor recently watched movies.

I hope this will give you some more ideas, and maybe you will come up with a different strategy to make your idea happen, if you think further out of the box?

Cheers,
Raymond

Hi Raymond,

First, a big thank you for those insightful comments and getting me to think through this.

It now sunk in that NN is really about “intrapolation” & classification based on values “nearby” to the training set, and not about extrapolation. The underlying regression model leads to thinking it can extrapolate, but I suppose a chain of activations creates a boundary that prevents the extrapolation.

I suppose these boundaries are the cause of the “non-linear”-ness also. Even in a chain of “relu” activations (that is piecewise linear), though the individual continuous parts of the boundary are linear (lines in 2D, planes in 3D, hyperplanes in multi dimensional space), the collection of these boundaries is not continuous and non-linear (“turning points” in your description). Somewhat related question - would the boundary of classification by a chain of relu-activations be always “convex”?

Hope I understood that right so far, but in anycase, I realize the time based classification will not work the way I thought it would. Like you mention, the cost function, and therefore the algorithm, would be incentivized to classify a specific time, and classifying “nearby” times is more of a side effect of using, for the sake of classification, a numerical range within the continuous space of numbers.

Thanks!

Hello @QuarkNJaguar,

Agree.

It is continuous but not linear.

What do you mean by a boundary in a classification problem to be convex? Can you describe that in another way?

Favoring recently watched movies makes a lot of sense. In fact I would consider to run more than one recommendation systems and one of which could be focusing on recently watched movies. Nobody said we have to use one and only one recommendation system to produce a list of recommendations.

Cheers,
Raymond

Hi Raymond,

For just two features (predictions in the image), and using just a chain of “relu” activations, I am assuming the boundary for classification (for the red circles) will always be a “convex” polygon (all angles < 180)? i.e., anything in that polygon will be classified as a circle. And in the case of more features, these are ‘hyper planes’ that always create a convex shape? I am assuming this because the relu activation just seems to partition the space along an infinite line/plane/hyper-plane and accept one side.

image

So how would you setup this recommender system that weights recently watched movies over others? Any pointers or references?

Thanks

I see. I am not sure people use “convex” to describe this, but I think your picture is right. Four straight lines is the simplest solution, and in real case, the boundaries can look more complex with more layers/neurons because the training algorithm won’t guarantee the simplest solution. For two-feature input, the outputs in the first layer should be boundary lines and the second layer onwards should be boundary polygons. Great observation @QuarkNJaguar!

Cheers,
Raymond