I understand that we are trying to compare the directionality with cosine similairity instead of the magnitude of two vectors, but still practically speaking why not use coeff of correlation ? can someone give me a practical explanation? thx,

There are a number of different ways you could compute a similarity metric between two vectors. There are several forms of correlation coefficients, but those are fairly compute intensive. For the most commonly used definition, you need to compute the covariance and the standard deviation of both inputs. Or you could use the Euclidean distance between the two vectors, but that gives you a number between 0 and 2. I have not done any research to see if ML/DL people mention why they chose cosine similarity, but note that it’s very cheap to compute because of this mathematical relationship:

v \cdot w = ||v|| * ||w|| * cos(\theta)

Where \theta is the angle between the two vectors. So the cosine similarity can be computed as:

cos(\theta) = \displaystyle \frac {v \cdot w} {||v|| * ||w||}

But we normalize embedding vectors to have length one, which makes that computation extremely cheap: just a single dot product and GPUs are pretty good at performing those.