Training a Softmax Classifier: Why isn't y stored as an array of the correct class and y_hat indexed into the position of y?

Emmett_Bicker · August 26, 2022, 3:38am

When learning the loss function for softmax, which is:

L(y, y_hat) = ∑ y*log y_hat

if y is only ever a vector with only one one in it, and the only thing that’s being computed is log y_hat for the index of class y, why is y stored as a vector of zeros and one one when instead it could just be its class number?

For instance if y was 2, then the cost could just be log (y_hat(index of 2)), which could be achieved by indexing into the position of y in y_hat. It would only be one calculation instead of a summation, and seems to me like it would save (albeit marginally) on computation from the summation and the memory space to save the zeros.

A lot of this is just me thinking that this way of doing it is pretty cool and wanting to share, but is there any reason why the industry standard (at least of 2017) was to store y as a vector and not a single integer?

paulinpaloalto · August 26, 2022, 3:49am

The point is that the summation is performed as a dot product between the “one hot” representation of the labels and the \hat{y} value (a softmax vector output), so it actually is more efficient because it’s vectorized. Doing it your way would make it a loop over the samples, as opposed to just a single matrix multiply. That’s really the point of “one hot” representation: it just makes it very simple and efficient to express the computations in a vectorized way.

Of course the “one hot” representation is much more costly in terms of memory space, so what typically happens is that when you store your data in a file, you keep the “labels” in “categorical” form as you describe: just one integer class value per sample. Then it’s easy to convert them to “one hot” representation at runtime when you’re doing computations with the data (e.g. running the training). TF has a function to convert from categorical to one hot. But if you’re using TF, then you can also use the “sparse” version of the categorical cross entropy loss and it takes the labels in categorical form. I haven’t looked at the TF code but I would bet that it just does the one hot conversion internally “on the fly” and hides it from us.

Emmett_Bicker · August 26, 2022, 1:28pm

Thank you so much! That makes a ton of sense

Topic		Replies	Views
Multi-class Y values Advanced Learning Algorithms week-2	5	520	July 14, 2022
Why for `ys` we use to_categorical method? Natural Language Processing in TensorFlow week-4	1	173	August 6, 2023
Week 3 Exercise 6 - compute_total_loss. Why transpose? Improving Deep Neural Networks: Hyperparameter tun	3	712	November 21, 2023
One hot vector label in Week 4: Multi-class Classification Convolutional Neural Networks in TensorFlow week-4	1	344	December 19, 2023
Course 4 Week 2 Exercise 3 - ResNet50 - General question - One - hot - vector Convolutional Neural Networks	2	508	December 10, 2021

Training a Softmax Classifier: Why isn't y stored as an array of the correct class and y_hat indexed into the position of y?

Related topics