In multiclass classification( eg:classification problem given in the assignment)  identifying the numb er from a given set of images.
I understand that we know that there are only 10 possibilities and hence the output layer will have 10 neuron. But, where is the assignment of a[0] is the class 1( digit 0) and a[4] is class 5 ( digit 5) done?
Hello @Suresh23,
Those assignments are Tensorflowâs assumptions, so they cannot be changed as long as you use the loss functions that Tensorflow prepares for you.
Raymond
The labeling is performed when the dataset is created.
Hi @Suresh23 ,
Your assumption that a[0] is class 1âŚ a[4] is class 5 etc is not exactly right. May be the examples show it this way to facilitate the learning. The process goes more or less like this:
We define a 10unit layer as the output. We expect that each one of these units will represent one of the classes (for instance, in the 09 numbers, each unit will represent one number). In other words, the output will be a vector of probabilities representing the likelihood that each class is present in the input data.
Which one of these will be representing number 1? Well, since the classifier is designed to classify digits, the output vector will contain 10 elements, where each element corresponds to the probability that the input image contains a particular digit. In this case, the position of the 1 in the output vector would not be fixed, but would depend on the predicted class.
I hope this expands your intuition on this matter.
Thoughts?
Juan
Thank you @Juan_Olano.
ââŚ, the position of the 1 in the output vector would not be fixed, but would depend on the predicted class.â
This may need further clarification: Asking again with code and output
from C2_W2_SoftMax for multiclass classification with 4 classes
*p_preferred = preferred_model.predict(X_train)*
*sm_preferred = tf.nn.softmax(p_preferred).numpy()*
*for i in range(5):*
* print( f"{p_preferred[i]}, category: {np.argmax(p_preferred[i])}")*
Here the category selected is " the index(of the neuron in output layer) with max value/ highest probability.
p_preferred : Probabilities  pred_class
[3.47 2.77 3.85 2.25]: [6.58e04 1.32e03 9.96e01 2.24e03], 2
[2.28 5.07 1.73 1.07]:[6.42e04 9.96e01 1.11e03 2.15e03], 1: 1
[ 6.39 4.34 4.71 14.5 ]:[1.00e+00 2.19e05 1.52e05 8.45e10], 0: 0
[1.52 4.6 1.93 2.25]:[0. 1. 0. 0.], 1: 1
[6.85 1.95 2.84 8.58]:[1.98e07 1.32e03 3.19e03 9.95e01], 3: 3
[1.7 2.32 0.39 0.13]:[0.02 0.85 0.06 0.07], 1: 1
[4.24 3.44 4.71 2.78]:[1.29e04 2.87e04 9.99e01 5.58e04], 2: 2
So question remains the same: How/where do we assume class to neuron assignment? If I understand your answer, it doesnât seem to be. It is not guranteed that
Check Juanâs answer below. Ir seems my assumption( or Tensorflows according to your answer) is incorrect.
I asked again with more details.
Iâm not asking about the y values in the training set.
The question is about âhow is that we assume the index of the neuron in the output layer represents a specific classâ, In the code below, it seems like neuron1 represents the probability of 0 and so on.Ofcourse we select the index of the highest prob as prediction.
Thanks in advance.
The neural network was trained on the target variable y. So, whatever was the order maintained in y will become the standard for the trained model.
As an example: In the training set, y[2] represents class 1 which is digit 0. So, in simple terms if y[2] = 1, then the input image was that of digit 0. Now, When we do the prediction  if a[2] has the highest probability, then it means the input image fed to the model was that of digit 0.
P.S. i have purposely complicated things by setting y[2] as class 1 â Digit 0âŚjust to show that we have the liberty to decide this order in the target variable y. However, once we set this order in the target variable y, from there on this same order applies while making the prediction.
I donât think this sounds right.
y and a are of different size. y is training set of m examples, where âaâ in array of size ânâ == no_of_neurons_in_output_layerâ == âno_of_target_categoriesâ.
Assuming m=100 and n=10  then you will see a y[90] but these is no corresponding a[90].
Let me try to explain this in another way:
We have 10 classes, then:

itâs our freedom and responsibility to assign each class to a class label number. The numbers must start from zero, then one, until nine.

we need the output layer to have 10 neurons
Then, itâs the language in Tensorflow that the zeroth neuron corresponds to the zeroth class, and the nth neuron corresponds to the nth class. These relations canât be changed, and donât need to be configured because they are the only default.
No matter how many samples you pass to the model for predictions, if, for a sample, the nth neuronâs output probability is the largest among all 10 neurons, then the nth class is the prediction for the sample.
Raymond
Continuing my above reply, if we are classifying cats, dogs, and tigers. Our freedom is to choose what class label 0 represents.
Class label 0 can represent cats if we like it to, or dogs if we prefer. This is a decision to make.
However, neuron 0 must correspond to class label 0. This is fixed by Tensorflow.
Tensorflow assumed the nth neuron to be the class label number n. We are just following Tensorflowâs assumption.
Or do you want to know how Tensorflow wrote the code to make the assumption happen? You want to analyze some Tensorflowâs code?
Raymond
Thanks @rmwkwok .
If you referring to assigning a numeric values to Y and feed it for training, I understand. I can choose arbitrarily any number( but in a seuqnece) to represent a category such as âcatâ OR âshirtâ etc.
I ran some experiments and found that neuron_index == class_code.
Iâm trying get under the hood and understand the math/logic behind it and develop an intuition. Andres explanation of softmax function for multiclass, briefly touched it, where he mentions, loss function is
Lets say, If we were to implement NN in pure python and Numpy only, how do I make this assignment.
I initially assumed this may be achieved through onehot encoding of the Y that is.
[1,3,1] will be converted to [[0,1,0], [0,0,3], [0,10]] and use this to calculate loss.
I looked into tensor code, this seems to be partially true. Depends on the Loss function we use.
But, I get it, these details are beyond the scope of this course and I can still work with it, with assumption you mentioned
Hello @Suresh23,
Itâs alright we can go into the code, and thatâs why I asked. I didnât explain it that way because it is too much details and I want to go a smaller step at a time. Give me some time and I will try to get you something.
Raymond
Hello @Suresh23,
Here we go.
Before we start, we need to note that there are 2 ways to present y_true
when computing the Categorical Crossentropy and they should yield the same result.
The first way is to present y_true
as a class label number (e.g. 3
). The second way is to present y_true
as an onehotencoded vector (e.g. [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
Note the 3rd (zerobased) element is 1 which represents it is class label 3). We will talk about the second way.
We also need to note that there are 2 ways to present y_pred
. The first way is to present y_pred
as a vector of logits (without softmax activation applied in the output layer). The second way is to present y_pred
as a vector of probabilities (with softmax activation applied in the output layer). We will talk about the second way.
Now itâs time for the code to see exactly how Tensorflow assumes the nth neuron to be class label number n! The tensorflow code below comes from here.
return tf.reduce_sum(target * tf.math.log(output), axis)
output
means y_pred
. target
means y_true
. OK?
Considering we have one sample, then:

output
can be thought of as a vector of 10 some values 
target
can be thought of as a vector of 10 values where we know that if the 3rd element is1
, then the vector represents that it is class label number 3, because WE PROVIDED thetarget
. 
target * tf.math.log(output)
is a elementwise multiplication, which means that the 3rd element intarget
is multiplied with the 3rd element inoutput
(after taking log). 
Remember our
y_pred
is onehotencoded, so ONLY one element is a1
and the rest are0
, so after the elementwise multiplication, only the 3rd element remains but the other are all zeros, OK? 
Now look at this part of the slide in this video 7:45:

The slide said the loss only picks one of the log(a_n)'s which is the one that y=n. Now our y=3, so we pick log(a_3), and the above elementwise multiplication does the picking for us because it made all elements other than the 3rd zeros! (You see this?)

Then, we do
tf.reduce_sum
, and since only the 3rd element is nonzero, the sum is equal to the 3rd element, and we finish the âpickingâ process that is presented in the part of the slide above. 
Donât forget that, at the same time, we have implicitly ASSUMED that the 3rd element in
output
vector to be corresponding to y=3, otherwise, we WOULDNâT HAVE multiplied the 3rd element in theoutput
vector to the 3rd element in thetarget
vector to implement the part of the slide above. The ELEMENTWISE MULTIPLICATION is the key takeaway! Think again? Why do we do the elementwise multiplication? What do we imply when using it?
This is the end. Now I said there are 2 ways to present y_true
, and there are 2 ways to present y_pred
, so there are a total of 4 implementations. Again, I want a smaller step at a time, so we only talked about one of the four implementations. However, the implementations of the 4 of them are different but similar, and the logic is the SAME.
Thoughts?
Raymond
@Suresh23 , If I were you, after reading my above reply, I would want to further the discussion and ask more questions or perhaps even try to implement something by myself (in numpy for example). But I will not assume you want to ask them and answer them before you actually ask. My above reply is long enough, isnât it? So, I rely on you to tell me what you want to know in order for this discussion to be something you mostly need.
If you think there is something you canât find in my last reply, make it clear to me and then I will see what I can get you.
Raymond
Hello @Suresh23
The example that i cited above is of 1 sample, and not of m samples.
If you take a âsingleâ training sample, assuming onehot encoding, y will be 1d array of 10 elements. One of those 10 elements will have a value of 1 and the remaining will be 0 (since its onehot encoding). It is here that we have set the order or class assignment.
The y[2] that i mentioned earlier has to be viewed in the context of a single sample, and not in the context of âmâ samples. If you have set y[2] = 1 (and the remaining 9 elements as 0) to represent digit 0; y[4] = 1 (and the remaining 9 elements as 0) to represent digit 7, then this becomes the class assignment. Of course, we can also keep it simple as y[0] = 1 to represent digit 0 and y[1] = 1 to represent digit 1 and so onâŚ
Now, when you are using the model to predict, we follow the same class assignment that was provided in y, which the model has internalized during training.
Thank you for this explanation @rmwkwok . Iâm going to try some of this logic myself before I ask more questions. Will comeback in after holidays.
Sure. You may try to implement the other three versions, and we can look at your works together if there are any questions!