Why "unnormalized log probabilities"?


why are the probabilities in the rnn_step_forward function in utils.py called unnormalized log probabilities? This is the 2nd assignment of the 1st week.
Maybe, only the second comment probabilities for next chars should be left?

def rnn_step_forward(parameters, a_prev, x):

Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
a_next = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b) # hidden state
p_t = softmax(np.dot(Wya, a_next) + by) # unnormalized log probabilities for next chars # probabilities for next chars

return a_next, p_t

Shouldn’t only the arguments of the softmax function be called unnormalized log probabilities?
Softmax is defined in this way (for the sake of simplicity, I didn’t include max(z) subtraction that is included in the original definition in utils.py):

p_{j}=\frac{e^{z_{j}}}{\sum_{i} e^{z_{i}}}

If we take the -log(p_{j}) that would be proportional to z_{j} but not normalized. Now it makes sense to me to call z_{j} unnormalized log probability. The expression for -log(p_{j}) is:

-\log p_{j}=-\log \frac{e^{z_{j}}}{\sum_{i} e^{z_{i}}}=-\left(\log e^{z_{j}}-\log \sum_{i} e^{z_{i}}\right)=-\left(z_{j}-\log \sum_{i} e^{z_{i}}\right),

and the proportionality coefficient is equal to -1:

k=\frac{-\log p_{j}-\log \sum_{i} e^{z_{i}}}{z_{j}}=-1

As an example of it in Python:

-z = np.random.randn(5)

[1.77946694 1.5168839 2.20200317 0.82859276 2.73903499]
that is clearly not normalized but proportional to z_j

k = -((np.log(softmax(z)) + np.log(np.sum(np.exp(z))))/z)

[-1. -1. -1. -1. -1.]

Is my reasoning correct?

I agree, since the activation function there is tanh(), “log” does not seem appropriate.

Dear Tom,

Why does the activation function matter?
I think, in principle, we could have any activation function to evaluate a_next and in softmax would multiply it by Wya weights and sum by, wouldn’t it? So, softmax would work with np.dot(Wya, a_next) + by.


Yes, it would.
But there’s no “log” involved unless you’re using something that has a log function in the cost equation.

I think I get it now. Thanks!