In the lectures regarding Word2vec andrew says that softmax is expensive because we have to sum the denominator for each node. Why would we not just calculate it once and then reuse it. also how does the hierarchal softmax work? i could not properly understand it in the video
On every iteration through training, you’d have to compute the softmax() again.
If you’re making thousands of iterations, and if the data set is very large, that’s a lot of calculations.
1 Like
Hello @lakshay1612,
I searched for “expensive” in the transcript of the video and found only this:
(12:02) But the key problem with this algorithm with the skip-gram model as presented so far is that the softmax step is very expensive to calculate because needing to sum over your entire vocabulary size into the denominator of the soft packs.
So, the key of it being expensive is that the entire vocabulary is used.
Cheers,
Raymond