Output layer `category to unit mapping` determination

As in ML Specialisation -> Course 2 -> Week 2 Lab Assignment, I’ve created a neural network to read handwritten digits from images. It has 3 layers,
L1 = 25 units
L2 = 15 units.
L3 = 10 units each representing a category (digit in this case)
How do I know which unit in the final layer represents probability for which handwritten digit? For example, how do I know that unit 1 doesn’t correspond to probability for digit 7.

As per ChatGPT:
“To know which unit represents which digit, you can refer to the order of classes when the neural network was trained. Often, the order is assigned based on numerical order, as in the example above. However, it’s crucial to verify this order and ensure it matches the expected class labels in your specific application. If the order is not as expected, you may need to adjust it accordingly.”

Is there a more definitive rule or way of determining this category to unit mapping?

Hey @p_s_rathore,

It’s by how you label them. If you label digit seven as 1, then the unit 1 represents digit seven. The algorithm only looks at the labels (which has to be started from 0), and you are responsible for how to assign labels to digits.


I recommend you not use ChatGPT for programming advice, or for help in working on the assignments.

It’s very likely to contain incorrect information.

1 Like

I had this exact same question and thought experiment in my head today. To take it a step further, recall in lecture:

a_1 = \frac{e^{z1}}{e^{z1} + e^{z2} + e^{z3} + e^{z4}} = P(y = 1|\overrightarrow x)

Why is a_1 the probability of y = 1 (the label 1) in the first place? Is there something about the formula that makes it so? Digging into it more, it seems this is more a design convention, but please correct me if I’m wrong. And I think it’s also by how we do back propagation and train the model, because we need to calculate the loss and to calculate loss, we need to have some “yardstick” each unit measures its outputs against (like when y = 1, or y = 2, and so on). I have not gotten to the back propagation lectures yet, but I hope this gets covered there and that it’ll all make more sense.

Yes, @trandromeda, I think it is right to say that this is a design convention, and it is the design of how to calculate loss, as you said.

Therefore, speaking of implementation design, if we trace the source code of tf.keras.losses.SparseCategoricalCrossentropy, we will get to this line:

cost = math_ops.negative(array_ops.gather(log_probs, labels, batch_dims=1))

It does the following “y-dependent selection of losses” by array_ops.gather (ref) which treats labels as array indices for log_probs.


See if you can follow the above and jump to the answer of your question.