It seems like the answer of first quiz from week 1 of course 3 doesn’t match the hint.

Maybe the answer is wrong?

In Multi-label you can have multiple output classes identified at the output of the model but in multi-class only one class will be identified at the output. So when it says identify all it means many labels at the output.

I just leave two cents here for an intuitive description to explain. Correct me if you have found what’s wrongly.

First of all, if I saw this question I would also select True. After some while I have understood the question, actually, this is my poor language-wise.

Suppose you have a mini-batch from the training dataset:

`X`

was two rows of flattened images.

```
[[.1,.2,.3,.4,.4],
[.1,.2,.3,.4,.4]]
```

`Y`

was two labels for them, we suppose we have two classes of the image, so

```
[[1],
[1]]
```

The possible `Y_hat`

might be like this or similar.

```
[[0.97],
[0.88]]
```

This is multi-classes. This is a classical coin game.

`max{p(img_class_1=1|v) + p(img_class_2=1|v)}`

Another side:

For multi-labels:

The `Y`

needs to be

```
[[0, 1],
[0, 1]]
```

The possible `Y_hat`

might be like this or similar.

```
[[0.23, 0.97],
[0.11, 0.88]]
```

This is multi-labels. <------ Cover the statement: identify all different items.

So the `Y_hat`

has two units and both don’t exclude each other, they are independent events. The algorithm (looks like sigmoid for two units) is trying to get

`max{p(img_class_1=1|v) * p(img_class_2=1|v)}`

The descriptions seem to be right to me, mostly, but the equations they don’t seem to be right.

For multi-class the sum of probabilities is 1 and you choose the maximum not the max of the sum.

For multi-label each output can have a range (if of course you use sigmoid) from 0 to 1, and there should be no need to take any maximum because every class is independent of each other, it excludes each other. This is my understanding.

good to know.

I will take these questions with me as I continue to read some of the documentation. Really appreciate some feedback from you. Right or wrong, it’s good to give advice, thanks.

In the real project, as I experienced, the multi-label seems more useful, especially when there are different classes on the same object, ie.

`x: Face photo y: women/man, with/without mask, long/short hair`

Definitely a good way to learn, I am also learning here and when I give my opinion, it may not be always correct but its a way of discussing points of view and delving more into the subject.

Hey, I quickly checked page 9. in http://cs229.stanford.edu/notes2020fall/notes2020fall/cs229-notes2.pdf , it describes that Naives Bayes, which addresses the story of the multi independent events.

I think for the multi-label, the goal of the algorithm is to reach

eq := `maxOf{p(img_class_1=1|v) * p(img_class_2=1|v) * .......} `

According to the solution of homework stanford-CS229/3_Gaussian_Discriminant_Analysis.ipynb at master · ccombier/stanford-CS229 · GitHub

The Bayes rule, in other words, just one event, can be derived into logistic form.

If certain multi-label will be applied, assume we have 3 labels, I would like to put 3 units sigmoid in the last layer, or let’s say 3 logistic functions. Each unit tries to reach the max so that the **eq** will reach the max.

WDYT? Maybe I am wrong, just want to exchange for this.

No their equation is right definitely, I can not argue that, it also makes sense because independetly each label’s probability tends to become higher so the product of them will be higher.