You are correct - what is being asked is what is the sum of the weights. And yes, the comment is strangely worded, I think it should have been “Calculate” instead of “Count”
What is this y_weights in compute_accuracy? I don’t remember seeing that in the lecture so far.
Isn’t this material information for the knowledge of this TRAX API?
What is the motivation for it? In the past, when we calculated the accuracy, we would take the “number of correct predictions” divided by the “total number of predictions.” What is the motivation for the weight? Are we saying that some test instances matter more than others?
The most obvious and common one is <pad> tokens, which usually have 0 weight, meaning the loss for predicting wrong or correct does not influence the model layers.
But there are also other use cases, depending on application, for example old data could be weighted less than “fresh” (stock market, etc.,), different data sources (like examples from less precise or reliable instruments might get down-weighted, or in NLP - “chat forum” data vs. Wikipedia), other subjective “expert” input (like data having higher variance might be less weighted because it might have outliers).
In other words, there are many cases when it’s useful to not treat every data point equally.
I’m not sure what you mean by that. These (that I mentioned) are different use cases and the idea (different weights for different samples) is widely used. One of the earliest use of it that I remember predates Deep Learning an was/is used in Reinforcement Learning as a discount factor (for example in Q Learning), (The discount factor represents the preference for immediate rewards over future rewards; in simpler terms, receiving $1 today is preferred over receiving the same amount two years from now).
By the way, there’s another important use case I forgot to mention - class imbalance. This occurs when one or more classes are more frequent than others. In such cases, it becomes essential to weigh the loss differently for various samples, based on whether they belong to the majority or minority classes. Essentially, we want to assign a higher weight to the loss encountered by samples associated with the minor classes.
I am talking about math or algebra. E.g., let’s take the idea of a norm. Usually, we think of it as |x dot x|. A generalization might be |Ax dot x| for a linear operator A.
It’s just simple element wise product (or a more fancy term - Hadamard product). For example, if the output is [0.9, 0.3, -2.1, 3.5] and the mask is [1, 1, 1, 0], the result after applying the mask would be [0.9, 0.3, -2.1]. You can easily extent this concept from vectors to matrices.