In the softmax regression video, we learn how to code output data into binary vectors (a chick become [0 0 1 0], a cat [0 1 0 0] etc.

Is this also what we should do when our inputs (Xs) are discrete and not continuous?

For example if I’m building a fictional neural network deciding whether I should make a decision (Y) based on several Xs, one of them being the category of animal (is it a chick, cat, dog?). Then the value for that X would be X={cat,chick,dog, elephant}. Should I convert that X into 4 variables? X1=is it a cat {0,1], X2=is it a chick {0,1} etc. before training the NN?

many thanks for your help!

Its not necessessary to convert from continuous to discrete for the input data but sometimes some features maybe be discretized for various reasons. The neural network can adapt to learn from continuous inputs and choose from discrete categories at the output.

many thanks for your help Gent

but if the raw inputs data are discrete, should they not be “binarized”? ie. converted to 0 or 1? (see my example).

if not, then I don’t get what sense it makes if when we use linear functions. For example if X=“cat” then what senses does x1X+b has? a discrete value like a word cannot be multiplied by a number, can it?

any input on this would be very useful. many thanks

Yes it depends on the example some may need to be to made discrete some others if they are numeric may not, but other operations maybe done on them such as normalization etc.

thank you for your precious help Gent

1 Like

The other general point here is that the activation values in the internal layers of the network are not “discrete” values. The only place that you get discrete values is at the output layer when you apply *softmax* (to get a probability distribution) and then select the highest probability as the discrete answer. At that level, you have a choice:

You can use “argmax” style conversion and leave your discrete values as numbers from 0 to K - 1, where K is the number of output classes. Then you would use a version of cross entropy loss that can handle that representation of the labels, e.g. “sparse” categorical cross entropy loss in TF.

Or you can convert the softmax output into the “one hot” representation, which is what you were showing in your earlier posts. In that case, you would then use a cross entropy loss function which takes “one hot” inputs, e.g. categorical cross entropy loss in TF.

1 Like

please ignore my previous post. I get it now :-). many thanks!!

Great! The answer is that the two methods are exactly equivalent in terms of the meaning of the results. The “sparse” style takes less memory space to represent the labels than the “one hot” style. That’s the only real difference.

To conserve memory, I think the most common method is to store the labels as numbers from 0 to K - 1. Then it’s easy to convert them on the fly to “one hot” form when you run the training. The “one hot” form is more efficient to process in terms of cpu usage. I’ve never actually looked at the TF code, but my guess is that the “sparse” version of the cross entropy loss function probably does the “one hot” conversion internally.

that’s great insights!! storing as numbers and then converting to “onee hot” on the fly is probably what I’ll do in my project. many many thanks!