hello everyone

there is point i don’t understand about multiclassification with softmax.the point is that we know mathematicaly that the computation steps of softmax we end up assigning the higher prob to the higher z so why we don’t simply assign 1 to the higher value of z(L) e.g applying hard max instead of doing all the calculations of the softmax function??

Yes, it is a good point that softmax is a monotonic function, so the maximum input will produce the maximum output. But what will you use as your loss function if you eliminate the softmax activation? There are a number of advantages that come from converting the predictions of the network into something that looks like a probability distribution. One big such advantage is that you have the cross entropy loss function as the ideal vehicle to drive the training.

can’t we use something like relu or svm cost function?

What is the “relu cost function”? You can try other cost functions, but for classification problems everyone uses cross entropy loss. There are very similar formulations for the binary and multiclass classification cases. Of course the standard is to use sigmoid as the activation in the binary case and softmax in the multiclass case. Both of those are exactly paired with cross entropy from a mathematical properties perspective. It’s not an accident that they use those pairings …

Of course this is an experimental science. If you think you have a better idea for a different function to use or perhaps just want to understand why people don’t use, say, MSE as the cost function for classifications, you are welcome to run the experiments. Try your alternative method and see what happens. If you find something that works better, publish the paper and tell the world your new discovery!