Multiclass - class values

In multi-class classification( eg:classification problem given in the assignment) - identifying the numb er from a given set of images.
I understand that we know that there are only 10 possibilities and hence the output layer will have 10 neuron. But, where is the assignment of a[0] is the class 1( digit 0) and a[4] is class 5 ( digit 5) done?

Hello @Suresh23,

Those assignments are Tensorflow’s assumptions, so they cannot be changed as long as you use the loss functions that Tensorflow prepares for you.

Raymond

1 Like

The labeling is performed when the dataset is created.

Hi @Suresh23 ,

Your assumption that a[0] is class 1… a[4] is class 5 etc is not exactly right. May be the examples show it this way to facilitate the learning. The process goes more or less like this:

We define a 10-unit layer as the output. We expect that each one of these units will represent one of the classes (for instance, in the 0-9 numbers, each unit will represent one number). In other words, the output will be a vector of probabilities representing the likelihood that each class is present in the input data.

Which one of these will be representing number 1? Well, since the classifier is designed to classify digits, the output vector will contain 10 elements, where each element corresponds to the probability that the input image contains a particular digit. In this case, the position of the 1 in the output vector would not be fixed, but would depend on the predicted class.

I hope this expands your intuition on this matter.

Thoughts?

Juan

1 Like

Thank you @Juan_Olano.
“…, the position of the 1 in the output vector would not be fixed, but would depend on the predicted class.”

This may need further clarification: Asking again with code and output
from C2_W2_SoftMax for multiclass classification with 4 classes

*p_preferred = preferred_model.predict(X_train)*

*sm_preferred = tf.nn.softmax(p_preferred).numpy()*

*for i in range(5):*
*    print( f"{p_preferred[i]}, category: {np.argmax(p_preferred[i])}")*

Here the category selected is " the index(of the neuron in output layer) with max value/ highest probability.

p_preferred : Probabilities - pred_class

[-3.47 -2.77 3.85 -2.25]: [6.58e-04 1.32e-03 9.96e-01 2.24e-03], 2
[-2.28 5.07 -1.73 -1.07]:[6.42e-04 9.96e-01 1.11e-03 2.15e-03], 1: 1
[ 6.39 -4.34 -4.71 -14.5 ]:[1.00e+00 2.19e-05 1.52e-05 8.45e-10], 0: 0
[-1.52 4.6 -1.93 -2.25]:[0. 1. 0. 0.], 1: 1
[-6.85 1.95 2.84 8.58]:[1.98e-07 1.32e-03 3.19e-03 9.95e-01], 3: 3
[-1.7 2.32 -0.39 -0.13]:[0.02 0.85 0.06 0.07], 1: 1
[-4.24 -3.44 4.71 -2.78]:[1.29e-04 2.87e-04 9.99e-01 5.58e-04], 2: 2

So question remains the same: How/where do we assume class to neuron assignment? If I understand your answer, it doesn’t seem to be. It is not guranteed that

1 Like

Check Juan’s answer below. Ir seems my assumption( or Tensorflows according to your answer) is incorrect.
I asked again with more details.

I’m not asking about the y values in the training set.
The question is about “how is that we assume the index of the neuron in the output layer represents a specific class”, In the code below, it seems like neuron-1 represents the probability of 0 and so on.Ofcourse we select the index of the highest prob as prediction.

Thanks in advance.

The neural network was trained on the target variable y. So, whatever was the order maintained in y will become the standard for the trained model.

As an example: In the training set, y[2] represents class 1 which is digit 0. So, in simple terms if y[2] = 1, then the input image was that of digit 0. Now, When we do the prediction - if a[2] has the highest probability, then it means the input image fed to the model was that of digit 0.

P.S. i have purposely complicated things by setting y[2] as class 1 → Digit 0…just to show that we have the liberty to decide this order in the target variable y. However, once we set this order in the target variable y, from there on this same order applies while making the prediction.

I don’t think this sounds right.
y and a are of different size. y is training set of m examples, where ‘a’ in array of size ‘n’ == no_of_neurons_in_output_layer’ == ‘no_of_target_categories’.
Assuming m=100 and n=10 - then you will see a y[90] but these is no corresponding a[90].

Let me try to explain this in another way:

We have 10 classes, then:

  1. it’s our freedom and responsibility to assign each class to a class label number. The numbers must start from zero, then one, until nine.

  2. we need the output layer to have 10 neurons

Then, it’s the language in Tensorflow that the zeroth neuron corresponds to the zeroth class, and the n-th neuron corresponds to the n-th class. These relations can’t be changed, and don’t need to be configured because they are the only default.

No matter how many samples you pass to the model for predictions, if, for a sample, the n-th neuron’s output probability is the largest among all 10 neurons, then the n-th class is the prediction for the sample.

Raymond

1 Like

@Suresh23

Continuing my above reply, if we are classifying cats, dogs, and tigers. Our freedom is to choose what class label 0 represents.

Class label 0 can represent cats if we like it to, or dogs if we prefer. This is a decision to make.

However, neuron 0 must correspond to class label 0. This is fixed by Tensorflow.

Tensorflow assumed the n-th neuron to be the class label number n. We are just following Tensorflow’s assumption.

Or do you want to know how Tensorflow wrote the code to make the assumption happen? You want to analyze some Tensorflow’s code?

Raymond

Thanks @rmwkwok .
If you referring to assigning a numeric values to Y and feed it for training, I understand. I can choose arbitrarily any number( but in a seuqnece) to represent a category such as “cat” OR “shirt” etc.

I ran some experiments and found that neuron_index == class_code.

I’m trying get under the hood and understand the math/logic behind it and develop an intuition. Andres explanation of softmax function for multi-class, briefly touched it, where he mentions, loss function is

Lets say, If we were to implement NN in pure python and Numpy only, how do I make this assignment.

I initially assumed this may be achieved through one-hot encoding of the Y that is.
[1,3,1] will be converted to [[0,1,0], [0,0,3], [0,10]] and use this to calculate loss.
I looked into tensor code, this seems to be partially true. Depends on the Loss function we use.

But, I get it, these details are beyond the scope of this course and I can still work with it, with assumption you mentioned

Hello @Suresh23,

It’s alright :wink: we can go into the code, and that’s why I asked. I didn’t explain it that way because it is too much details and I want to go a smaller step at a time. Give me some time and I will try to get you something.

Raymond

1 Like

Hello @Suresh23,

Here we go.

Before we start, we need to note that there are 2 ways to present y_true when computing the Categorical Crossentropy and they should yield the same result.

The first way is to present y_true as a class label number (e.g. 3). The second way is to present y_true as an one-hot-encoded vector (e.g. [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] Note the 3rd (zero-based) element is 1 which represents it is class label 3). We will talk about the second way.

We also need to note that there are 2 ways to present y_pred. The first way is to present y_pred as a vector of logits (without softmax activation applied in the output layer). The second way is to present y_pred as a vector of probabilities (with softmax activation applied in the output layer). We will talk about the second way.

Now it’s time for the code to see exactly how Tensorflow assumes the n-th neuron to be class label number n! The tensorflow code below comes from here.

return -tf.reduce_sum(target * tf.math.log(output), axis)

output means y_pred. target means y_true. OK?

Considering we have one sample, then:

  1. output can be thought of as a vector of 10 some values

  2. target can be thought of as a vector of 10 values where we know that if the 3rd element is 1, then the vector represents that it is class label number 3, because WE PROVIDED the target.

  3. target * tf.math.log(output) is a element-wise multiplication, which means that the 3rd element in target is multiplied with the 3rd element in output (after taking log).

  4. Remember our y_pred is one-hot-encoded, so ONLY one element is a 1 and the rest are 0, so after the element-wise multiplication, only the 3rd element remains but the other are all zeros, OK?

  5. Now look at this part of the slide in this video 7:45:
    Screenshot_20221225_103013

  6. The slide said the loss only picks one of the log(a_n)'s which is the one that y=n. Now our y=3, so we pick log(a_3), and the above element-wise multiplication does the picking for us because it made all elements other than the 3rd zeros! (You see this?)

  7. Then, we do tf.reduce_sum, and since only the 3rd element is non-zero, the sum is equal to the 3rd element, and we finish the “picking” process that is presented in the part of the slide above.

  8. Don’t forget that, at the same time, we have implicitly ASSUMED that the 3rd element in output vector to be corresponding to y=3, otherwise, we WOULDN’T HAVE multiplied the 3rd element in the output vector to the 3rd element in the target vector to implement the part of the slide above. The ELEMENT-WISE MULTIPLICATION is the key take-away! Think again? Why do we do the element-wise multiplication? What do we imply when using it?

This is the end. Now I said there are 2 ways to present y_true, and there are 2 ways to present y_pred, so there are a total of 4 implementations. Again, I want a smaller step at a time, so we only talked about one of the four implementations. However, the implementations of the 4 of them are different but similar, and the logic is the SAME.

Thoughts?
Raymond

1 Like

@Suresh23 , If I were you, after reading my above reply, I would want to further the discussion and ask more questions or perhaps even try to implement something by myself (in numpy for example). But I will not assume you want to ask them and answer them before you actually ask. My above reply is long enough, isn’t it? So, I rely on you to tell me what you want to know in order for this discussion to be something you mostly need. :wink:

If you think there is something you can’t find in my last reply, make it clear to me and then I will see what I can get you.

Raymond

1 Like

Hello @Suresh23

The example that i cited above is of 1 sample, and not of m samples.

If you take a “single” training sample, assuming one-hot encoding, y will be 1-d array of 10 elements. One of those 10 elements will have a value of 1 and the remaining will be 0 (since its one-hot encoding). It is here that we have set the order or class assignment.

The y[2] that i mentioned earlier has to be viewed in the context of a single sample, and not in the context of “m” samples. If you have set y[2] = 1 (and the remaining 9 elements as 0) to represent digit 0; y[4] = 1 (and the remaining 9 elements as 0) to represent digit 7, then this becomes the class assignment. Of course, we can also keep it simple as y[0] = 1 to represent digit 0 and y[1] = 1 to represent digit 1 and so on…

Now, when you are using the model to predict, we follow the same class assignment that was provided in y, which the model has internalized during training.

1 Like

Thank you for this explanation @rmwkwok . I’m going to try some of this logic myself before I ask more questions. Will comeback in after holidays.

1 Like

Sure. You may try to implement the other three versions, and we can look at your works together if there are any questions!

1 Like