Hello, @Artem_Vashina, the idea is the same.
Let’s say we have a sample, and we get a prediction. Because it has 10 classes, the prediction has 10 probability values:
[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
One-hot label
Now, with [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
as the label, essentially, it picks only the 2nd (0-base) probability value to compute the cost. The “picking” can be easily done with a element-wise multiplication because after that, everything else are zeros except for the 2nd one:
Digit label
With 2
as the label, then the picking is even more intuitive, right? There is some function to do such a picking.
After picking.
With the picked, logged probability, we can then compute the cost. In other words, we only get involved the probability for the true class and ignore all the others.
Since in your last understanding, you emphasized on “closer to [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]” which is a very very good and precise intuition, so I suppose you might wonder, given that we ignore the others, would it still get closer to the distribution?
The answer is Yes! Getting only the true class involved in the computation of cost does not mean that only the predicted probability for the true class will get close to 1. The fact is, at the same time, the predicted probabilities for all of the false classes will also get close to 0, owing to the fact that each and every probability value, including the picked one, is a result that involve all predictions for all classes. Remember our softmax formula:

I will stop here for now. We can dig deeper if you want. Let us know.
Cheers!
Raymond