Reducing bias in word embeddings vs performance

Hi all,

I really hope this doesn’t come across as a “I’m not XX, but…” post; I just would like to understand more about the problem of reducing gender/ethnicity/other bias in word embeddings.

If we create a NLP system which is based on large datasets which most likely will include social/cultural bias such as for gender or ethnicity, will we be removing “useful” learning by removing bias? Example: a typical gender divided choice of profession in my country is data science vs. psychology (male/female respectively). If we cut out the, unfortunate, link between “male”:“data scientist” or “female”:“psychologist”, will we make our system perform worse?

I understand that my example is quite benign compared to other examples of bias in word embeddings, and I wholeheartedly support that we try to remove bias to not reinforce negative aspects of our society and culture. If the answer is yes, that we will reduce accuracy of our systems, I hope that will only be true in the short run and that in the future the need to reduce such toxic bias will be removed :crossed_fingers:


I have asked myself the same question. So I can only give my opinion on that, which is: I think it may depend on your task. And often it is the question: Do you want to mimic the reality (which unfortunately is sexist, racist, etc. often), or do you deliberately dont want that biases.

Furthermore, I’m curious about bias, which is not really a bias if you include domain knowledge. For example, (most) medical deseases can affect people of all genders. But some are (much) more likely to affect males, and some females (–> my conclusion: keep the ‘bias’, as it’s a medical thing and not a language bias). But also, medical studies are dominated by males often; so training on medical papers may skew everything towards males (–> unsure if should remove the bias).

Actually, that’s a hot topic and an important question, which is not easy to answer I think…

1 Like

@Eslgr, yes that is a good way of looking at it! Thank you for your input. I’m definitely thinking there will be cases where the bias should be kept. But as you reflect, there might be underlying reasons for why such bias exists in the first place (skewed representation of gender/race/socioeconomic status etc.).

In any case it will be important to consider ethics and take a holistic view on the application and methodology to avoid both undesirable bias and removing useful information from the input. Being a newbie to the field, my current focus is on the basic understanding and implementation of different models; how ML work before how ML fit in society (typical engineer/scientist I guess :sweat_smile:). Looking forward to applying what I’ve learnt in this specialization and get some real-life experience :slightly_smiling_face: