'g' in neutralize() week 2 assignment


In the first programming assignment for week 2, there is an array ‘g’ passed into the function neutralize(), which corresponds to the axis we want to neutralize. It’s dimension is (50,) How is this ‘g’ determined?

From what I could understand, word embeddings with 50 ‘features’ would have dimensions (50, vocab_size). And the embedding corresponding to a given ‘feature’ (say gender) would have shape (1, vocab_size). But clearly this is not the case for ‘g’ and I didn’t quite understand why. I would expect (50,) to be the shape for a word in the vocabulary, say, ‘receptionist’…

I could use some help to understand how ‘g’ is computed.

Thank you!

In this assignment, g is calculated in the previous cell as follows.

g = word_to_vec_map['woman'] - word_to_vec_map['man']

Each word vector consists of 50 features as you wrote. And, this is a simple subtraction. So, g has the same dimension as any word vectors, i.e, (50,).

This is based on assumptions like these;

  • There is a gender bias in words.
  • Let’s pick up “man” and “woman” for this exercise. Essentially, the difference of vectors between “man” and “woman” is caused by a gender. In this sense, other features than “gender” may be similar.
  • So, if we subtract a vector for “man” from a vector for “woman”, then, the remaining is pretty much focusing on the gender. That’s g for this exercise.

So, we assume that g can be used to dig which words have gender bias with using cosine similarity. “neutralize” is to use this g to remove “gender bias”.

You may understand whole picture above, and just miss one cell to calculate g, but the above is a whole story. Hope this helps.

Yes, that clarifies it indeed. I somehow missed the cell computing ‘g’, but even so, your explanation is more helpful than just seeing it’s computed value. Thanks!