Training a Softmax Classifier: Why isn't y stored as an array of the correct class and y_hat indexed into the position of y?

When learning the loss function for softmax, which is:

L(y, y_hat) = ∑ y*log y_hat

if y is only ever a vector with only one one in it, and the only thing that’s being computed is log y_hat for the index of class y, why is y stored as a vector of zeros and one one when instead it could just be its class number?

For instance if y was 2, then the cost could just be log (y_hat(index of 2)), which could be achieved by indexing into the position of y in y_hat. It would only be one calculation instead of a summation, and seems to me like it would save (albeit marginally) on computation from the summation and the memory space to save the zeros.

A lot of this is just me thinking that this way of doing it is pretty cool and wanting to share, but is there any reason why the industry standard (at least of 2017) was to store y as a vector and not a single integer?

The point is that the summation is performed as a dot product between the “one hot” representation of the labels and the \hat{y} value (a softmax vector output), so it actually is more efficient because it’s vectorized. Doing it your way would make it a loop over the samples, as opposed to just a single matrix multiply. That’s really the point of “one hot” representation: it just makes it very simple and efficient to express the computations in a vectorized way.

Of course the “one hot” representation is much more costly in terms of memory space, so what typically happens is that when you store your data in a file, you keep the “labels” in “categorical” form as you describe: just one integer class value per sample. Then it’s easy to convert them to “one hot” representation at runtime when you’re doing computations with the data (e.g. running the training). TF has a function to convert from categorical to one hot. But if you’re using TF, then you can also use the “sparse” version of the categorical cross entropy loss and it takes the labels in categorical form. I haven’t looked at the TF code but I would bet that it just does the one hot conversion internally “on the fly” and hides it from us.


Thank you so much! That makes a ton of sense