Softmax weights redundancy

Let’s take a look at the softmax activation with n nodes:

  • w1, b1
  • w2, b2
  • wn, bn

If we add the same value w, b to each node params, the result will not change. I.e. softmax layer with these parameters generates the same probabilities:

  • w1+w, b1+b
  • w2+w, b2+b
  • wn+w, bn+b

Is it an issue for the optimization? Can the number of parameters be reduced by 2?

I think you should be the one who tell us why it is or it is not an issue and your reasons, and then we can discuss from there.

Please also check this reply for how you could help drive the discussion. :wink: However, you don’t need to do anything right now, but you have the choice to when you are ready to.

Raymond