Loss function for multilabel classification

Hi! I wanted to know what a proper loss function is for multilabel classification. As far as I understand, we can use a layer of multiple nodes and use the sigmoid function as the activation function. But what loss function should we use for training? Would it be binary cross-entropy, and should we apply it to each of the nodes?
Another question I have is whether we can convert a multilabel classification problem into a multiclass one by mapping the array of occurrences to its corresponding binary number. For example, [1, 0, 0, 1] would be declared as nine, and then we would use softmax for it.

Yes, for the case of a multi-binary-label problem. You might start experimenting with this function and supply it with some toy y_true and y_pred.

Softmax adds up to one only. Think about what we can expect the model to predict in the following two cases, under the constraint that softmax always adds up to one:

  1. that only one label is True.
  2. that three labels are True.

Softmax adds up to one :wink:


Thanks for the response! I think my second question was a bit vague. Consider that we have a multilabel classification where we want the answer to be equal to [1, 0, 1, 0], which means the first and third labels have occurred.

I’m asking whether we can change our output layer to a softmax with (2^(number of labels)) nodes, where each node corresponds to its binary representation indicating which labels must have occurred. For example, in this case, we would like the probability of the 10th node to be the highest, representing the array [1, 0, 1, 0], and giving us the same answer as a multilabeling model.

I’m asking whether this conversion from a multilabeling problem to a multiclassification problem holds, and whether it is computationally feasible to use regularly.

Hello @Delaram_Fartoot,

You think differently :wink:

That is an option, and I believe you are aware of the growth of the number of nodes in the output layer with the number of combinations (or labels), and I think that might also be why you were concerned about the computational feasibility and I share the same concern. The math for the additional memory required compared to the method in your first question should be obvious and we can estimate that easily as we need, then we know if that is feasible in terms of memory.

I would be more concerned about whether we could get enough data to well distinguish similar samples that share the same values in many but not all labels. In other words, with exponentially growing number of trainable parameters, can the number of samples catch up? We don’t just need one more sample for each additional trainable parameter. It would be lucky for us if the deciding features of a label are quite independent from those of another labels, otherwise, we might need way more samples to make the decision boundary clearer. For how many samples are sufficient for the dataset in question, only experiment and performance metric can tell, but once we have some ideas about the necessary sample size we are talking about, we can think about the feasibility in terms of processing power and time.

The approach in your second question isn’t cheaper nor simpler, and wouldn’t be my first choice.


Thanks! I completely understand the necessity of having more samples for proper distinction. Once again, thank you for your time and help. :slightly_smiling_face: :blossom: