my experience is that fewer lines of programmer written code means fewer bugs, less to document, less to test, and faster time to value. we could write our own implementations of sigmoid, too, right? or the exponential function. or matrix multiplication. or write it all in assembler code instead of TF or Python. but we generally we use the productivity of higher level languages and their built-in library functions when we can.

Also to confirm one thing, unlike cnn softmax can not be shallow because the output of each activated unit details on the z-values from the previous layer. It is like each unit is bounded to accept the value of the input in the denominator.

As you may know, softmax is actually an extension of the sigmoid function.

You would use sigmoid for binary classifications, and then you would use softmax for multi-class problems.

You could use softmax for a binary classification as well, but sigmoid seems more simple in terms of performance.

You could not use sigmoid for multi-class classification as you would get only a probability of an event, while softmax will provide a range of probabilities in a vector, where each vector entry is the probability for one of the classes.

If we were to use only sigmoid instead of softmax in a multi-class case, how would we assign the resulting probability to a given class? Lets say we have 5 classes, and the sigmoid throws 0.4 as the result. How would we interpret this number? however, in softmax, since we have a vector and each entry learns to represent a class, the 0.4 result would be in the context of all classes.

I misunderstood the original post, thinking it was asking why use a library function when it could be written by hand. But now I think it is about n output nodes each using sigmoid to produce 1 output, the set of which could then be normalized. versus 1 output node using softmax to produce n probabilities already normalized. Conceptually a column vs row kind of thing? If the loss function and label shape are adjusted accordingly, would the two approaches be functionally equivalent?

I would say that this looks pretty much like softmax.

Softmax is an extension of sigmoid. It is like multiple sigmoids stacked in a vector, or put in your words, âit is about n output nodes each using sigmoid to produce 1 outputâ.

If youâre doing multiple classification and for a prediction you only need to find the output with the highest value, then you donât need softmax().

softmax() wonât change which output has the highest value.

I have one follow up question, does softmax calculate conditional probability or the independent probability? If we go with wikipedia, the function is based on Luce's choice axiom - Wikipedia which is for independent probability but the text in course lab says otherwise

the softmax spans all of the outputs. A change in z0 for example will change the values of a0-a3. Compare this to other activations such as ReLU or Sigmoid which have a single input and single output.

The softmax calculates the probability of each class as a function of the probability among all the classes. I would say this is a conditional probability.

The reference (Luceâs choice axiom) is clearly a case of independent probability: You have, say, a bag of items, and you randomly and blindly pick one. In this case, I can clearly see how this is independent.

I understand softmax differently. It is like a âthought processâ where the model is considering all the alternatives to finally define which one has the highest probability.

On a parallel note you mention something important: softmax is not only a way to find classes, but also it can act as a normalization technic.

How I used to think is it like giving yes / no for each class and then normalising, for example. Whether it is class 1 or not (this is binary, sigmoid can be used) now on the next neuron, whether it is class 2 or not (again, binary) âŚ and so on

Lastly we use it to normalise between 0-1 which is done by default in in sigmoid because of its nature.

@Juan_Olano is it a right way of thinking as newbie?

In your last comment you are referring to a multi-class classification, where we have a data set that represents multiple classes, like multiple fruits: apples, bananas, coconuts, etc.

In this case, each sample most be labeled with one and only one class (one sample cannot more than 1 class at the same time).

IF you have ânâ classes, then in your model you define the last layer as a Dense layer with ânâ units, and an activation=softmax. This last layer will assign each unit to one class. For example:

Unit_0 = apple
Unit_1 = banana
Unit_2 = coconut

Fast forwardâŚ at some point the model will tell you if a sample is a banana with a set of probabilities that may look like this: