Ok I understand the part where we take the exponent of the matrix and then divide every element by the sum . My question is that We could have done the divide step without taking exponent , right ! Why do we need to take exponents before dividing ??

Hi @Kamal_Nayan thank you for your question,

if we do not take exponents before dividing in the softmax function, the **resulting values would not represent a valid probability distribution over the classes**. The softmax function would fail to achieve its primary purpose of converting arbitrary real values into probabilities. Instead, it would retain the original scale of the input values, and they would **not be normalized** (to sum up to 1).

And can you explain how exactly does exponents help us in converting those numbers into probabilities??

Can you give me a statistical view into this ??

\exp^{x} satisfies mathematical properties to convert to a probability from a logit. These properties are:

- Positivity: \exp^{x} always returns a positive value, since probabilities cannot be negative.
- non-linearity: \exp^{x} is highly non-linear, as x increases, there is a bigger difference in the features making the model to be more confident in the predictions.
- normalization: if \exp^{x} is divided by the sum of all classes. At the end, each probability will sum up to 1.

These three properties are necessary to define a probability according to what is called Kolmogorov axioms in statistics.

Thanks for this