Greetings!!
The formula for computing the target word given the context is given as follows in the second week of the sequential model course. But, I am unable to understand the intuition behind this formula. Could anyone kindly explain?
The exponential is used because the loss function uses log, so when you compute the partial derivative of loss (to get the gradients), the log and exp disappear in a similarly way to the logistic regression gradients.
“theta” is an older notation that Andrew uses to indicate the trained weights in some of his other courses. “theta_transpose” is a mathematical implementation so the dimensions of theta and e_c are compatible.