C2_W2_Multiclass_TF - Output layer explanation

In the optional lab C2_W2_Multiclass_TF, following explanation is given for layer 2 at the end of the notebook:

One other aspect that is not obvious from the graphs is that the values have been coordinated between the units. It is not sufficient for a unit to produce a maximum value for the class it is selecting for, it must also be the highest value of all the units for points in that class.

What does “not sufficient for a unit to produce a maximum value for the class” exactly mean? The unit outputs one value, calling it “to produce a maximum value” is a bit confusing, as you can’t choose a maximum from just one value. Does it instead refer to the fact that you don’t have a cut off from which you decide on e.g. a binary category? You choose instead the category from the maximum of all unit outputs.

Hello @Oliver_Schneider,

Under a fixed set of samples, an unit produces a value in a certain range, right? Let’s say:

unit 0’s range is 0 to 3
unit 1’s range is 0 to 100
unit 2’s range is 0 to 100
unit 3’s range is 0 to 100

Then,

not sufficient for a unit to produce a maximum value for the class

means that even if unit 0 gives 3 (its maximum) to a class 0 sample, it is not sufficient if other units will give a 5 to it (even though 5 is considered small to other units’ ranges).

The idea is unit 0’s output for class 0 has to be larger than other units’ outputs for class 0 samples.

Cheers,
Raymond

1 Like

Hi Raymond,

thanks a lot for the clarification, this coincides with my reasoning.

I think I was mainly irritated by the wording of not sufficient. It is not only not sufficient but has nothing to do with the classification, as only the maximum of all units counts. (For example, it is also not sufficient that the output value is divisible by 2 or a prime number but that has nothing to do with the classification)

Dear Mr Raymond,

Could you please provide an example on how to increase the unit’s range so that the unit’s output for a class has to be larger than other units’ outputs for that class samples?

*Based on the example (figure below) given by this lab, i am confused on how to set up or even increase the range of unit

Unit 0’s range is -120 to 0
Unit 1’s range is -40 to 20
Unit 2’s range is -40 to 0
Unit 3’s range is 0 to 15

*Please correct me if i have any misunderstanding on this concept

Thank you

Hello @JJaassoonn,

Did you verify that your ranges are correct? Take unit 2 as an example, even though -40 and 0 are readable from the scale, does it mean -40 and 0 are the minimum and the maximum respectively? If you hadn’t verified it, would you like to first think about how to verify the ranges so that we can base our discussion on some verified numbers?

Let me know.

Cheers,
Raymond

PS: you might take the trained weights and biases out from the layer 2 by model.layers[1].trainable_weights.

Dear Mr Raymond,

Sorry for my mistake. I have updated the figure as below.

Unit 0’s range is -40 to 0
Unit 1’s range is -20 to 10
Unit 2’s range is -20 to 10
Unit 3’s range is -2 to 8

These are the results from model.layers[1].trainable_weights

W2 → [ [-2.01 -3.07 1.3 0.33]
[-2.83 1.09 -1.89 0.69] ]

b2 → [ 3.18 0.21 -1.31 -2.61]

[<tf.Variable ‘L2/kernel:0’ shape=(2, 4) dtype=float32, numpy=
array([[-2.01, -3.07, 1.3 , 0.33],
[-2.83, 1.09, -1.89, 0.69]], dtype=float32)>, <tf.Variable ‘L2/bias:0’ shape=(4,) dtype=float32, numpy=array([ 3.18, 0.21, -1.31, -2.61], dtype=float32)>]

Interesting. The scales look strange to me.

So you are sure that the ranges are correct now, right? Note that I expect you to verify your numbers.

Would you mind to elaborate your question with some examples that use the numbers from your ranges? Like how the ranges contradict with your assumption (and what is your assumption?).

Thanks,
Raymond

Btw, @JJaassoonn, please take your time. I won’t be able to respond any time soon since I have a few committments ahead. Take your time to verify the numbers and please help by elaborating your question as I have asked. I need to know where contradiction happened.

Cheers,
Raymond

Dear Mr Raymond,

Please advise me of whether my concept is correct.

These are the results from model.layers[1].trainable_weights

W2 = [ [-2.01 -3.07 1.3 0.33]
[-2.83 1.09 -1.89 0.69] ]

b2 = [ 3.18 0.21 -1.31 -2.61]

For Linear Output Unit 0,
W2[0,0] = -2.01
W2[1,0] = -2.83
b2[0] = 3.18

The output of the Unit 0
= is in the approximate range of -50 to 5 indicated by the color bar
= a0 * W2[0,0] + a1 * W2[1,0] + b2[0]

Unit 0 will produce its maximum value for values near (0,0), where class 0 ( C0 , blue color ) has been mapped.

Unit 1,2,3 will produce their maximum value at other location because of the different calculations of W2 and b2

Unit 1:
W2[0,1] = -3.07
W2[1,1] = 1.09
b2[1] = 0.21

Unit 2:
W2[0,2] = 1.3
W2[1,2] = -1.89
b2[2] = -1.31

Unit 3:
W2[0,3] = 0.33
W2[1,3] = 0.69
b2[3] = -2.61

“One other aspect that is not obvious from the graphs is that the values have been coordinated between the units. It is not sufficient for a unit to produce a maximum value for the class it is selecting for, it must also be the highest value of all the units for points in that class.”, quoted from C2_W2_Multiclass_TF lab material

From the first sentence,
“One other aspect that is not obvious from the graphs is that the values have been coordinated between the units.”

  1. May i have any example of “the values have been coordinated between the units”? I thought that each unit has its own values which are not related to other units because of different W2 and b2.

From the second sentence,
“It is not sufficient for a unit to produce a maximum value for the class it is selecting for”

  1. May i know why it is not sufficient? Each unit has its own W2 and b2 to calculate the range of output. For example, Unit 0 selects class 0 by calculating a range of output from a minimum value -50 to a maximum value 5. ← It is sufficient

Thank you

Hello @JJaassoonn

-50 to 5 is a good approximation. To get the more precise value (which is actually not quite necessary up to this point), we can subsititute a0 = a1 = 0 to get 3.18, and subsititute a0=8 and a1=10 to get -41.2. Therefore, the range is -41.2 to 3.18.

Again, it is not necessary to be that precise up to this point… We only need to be precise when the ranges are too close.

Your statement about unit 0 is correct and important!

We want unit 0’s output to be the largest for samples of class 0. For unit 0 to be the largest, just seeing that the blue dots being at the lower bottom corner is not enough, instead , we also want to see that unit 1’s, 2’s and 3’s outputs are smaller in value compared to unit 0’s output.

For unit 3, it is quite clear that class 0 samples are located in the whitest corner which looks to be small than 0, and if that is the case, then unit 0’s output for samples of class 0 is larger than unit 3’s (which I will leave it to you to verify :wink: )

For unit 1 and 2, they are less clear just from the graph, but you can verify them from their values.

I have explained in the above example. If you want unit 0 to be the largest all of units, you do not only need unit 0 to be large, you also need unit 1, 2 and 3 to be small.

Remember, you need unit 0 to be the largest among all unit, not just large among itself.

There is no simple example to that, but there is an explanation to that.

The set of weights and biases that you had copied to your post was the result of the model training, right? And they were already the “coordinated” result. The forces of that coordination behind the scene are our friends - the softmax and the cost function.

The softmax and the cost function together makes sure that when a sample of class 0 comes in during the training process, unit 0’s output will be maximized while unit 1, 2, and 3’s output are minimized. To see this, you can compute the gradients with respect to all of the 4 units (which I have shown here).

Cheers,
Raymond

Dear Mr Raymond,

Thank you so much for helping me to understand the theories behind the scene.

You are welcome, @JJaassoonn

Cheers,
Raymond

Hi,
In the Explanation for Layer 1, why isn’s b[0] added to the relu input?
image

1 Like

Hello, @nirdo,

There should be the bias term. I think it was just missed out in the equation. I will share this with the course team.

Cheers,
Raymond