Is multi-task learning always immune to missing labels of some of the tasks?

In the course Andrew mentioned that it’s okay if some of the task are missing a few labels. I felt like this was a general statement, but I saw some other examples which made me wonder if he meant this only specifically to the cost function he used.

It made sense when with the cost function he used, which only sum over the cross entropy of the available labels. But I came across an example today where I feel like the statement doesn’t apply:
This multi-task is made up of a classification task for gender, and regression task for age. The two task had separate and different cost functions, sparse categorical cross entropy and mse. The cost of the model was the weighted average of the two cost functions. The code was written in keras and the specific piece of code looks like this:

model.compile(optimizer='rmsprop',
          loss={'age_output': 'mse', 'gender_output': 'sparse_categorical_crossentropy'},
          loss_weights={'age_output': .001, 'gender_output': 1.})

The keras documentation on loss_weights is:

  • loss_weights : Optional list or dictionary specifying scalar coefficients (Python floats) to weight the loss contributions of different model outputs. The loss value that will be minimized by the model will then be the weighted sum of all individual losses, weighted by the loss_weights coefficients. If a list, it is expected to have a 1:1 mapping to the model’s outputs. If a dict, it is expected to map output names (strings) to scalar coefficients.

it doesn’t say anything about not scaling up the cost of one task if the label of the other task is missing. So it seems like in this case, a missing label will affect the result.

Hi, thoughtful question. If you have missed a label, the reason it doesn’t affect training/learning to a large extent is, since you haven’t labelled it, it isn’t judged upon. I.e Think about it like this, you’re showing a model an image with a cat and a dog, with only the cat labelled. This is equivalent to showing the network an image and asking do you see a cat? and the network says yes/no and then the cost ensures to push the network in the right direction. The reason not labelling the dog is not a problem, is because you’re not punishing it even if it does classify a dog.

However you are right, if you have a different cost function, this fact may not hold true, i.e not predicting each class may penalize the network unnecessarily. But even then, usually, the cases of having missed some labels in the set is never a really big issue because the number of such examples is usually very small compared to the total set images, hence it’s almost like noise and won’t affect training all that much which is why Andrew states it generally.

You’ll also hear Andrew mention that even mislabelled data is usually not that big of a problem. And he only suggests fixing it if it accounts for the majority of the portion for your errors by analyzing 100 of the mislabelled outputs in the dev set.

1 Like

Hi @Jaskeerat , thanks for your reply.
You mentioned:

But I thought the data for a task would only have the label for that task. So say I have n task each with m data, the total image size would be n * m. Using multi-task learning, all n * m samples are missing n-1 labels.

Aah, so what you meant is, all your training data has some labels missing? I assumed what you meant was, in the generation/acquisition of data and the process of labelling it, some labels were missed, so most of the cases would have all labelled, but only a few had a label or two missing, which is why I called it ‘noise’.

If you are aware that ALL your data does not have all classes labelled, then you need to be smart about how you train it for multi-class detection and make sure your loss function does not penalize extra detections made by your network severely.
I am not sure how you would make the best use of input data, each with a few classes unlabelled for a good multi-class detection task. I would appreciate it if a mentor weighed in on this.

Where would you even find training data with one less class labelled for each example? Normally, this problem is uncommon and basically noise. Correct me if I’ve understood anything incorrectly

I feel like each data would only have one label for one task (and missing labels for the other n-1 tasks). This is why:

Andrew says multitask learning works when

  1. training from a set of tasks that could benefit from having shared lower-level features
  2. for each class, the number of training data from the other classes is a lot larger than the number of training data for that particular class.

So it seems the reason why multi task learning works is because we get more data than the number of data we have for each individual task, and these data from the other tasks help the model learn basic shared features.

Consider the situation where I have n task each with m samples. Each sample only has the label for its task. In this situation, using multitask learning gives me n*m samples to learn the basic features that are shared between the tasks. So multi task learning makes sense.

Consider another situation, where for each sample I have all or most of the labels of the different tasks. If so, I would already have n * m samples to learn any particular task. In this case, the above reasoning for multi task learning won’t hold.