Considerations on Stemming and Pre-trainned word embeddings

Hi,

Assuming I have an encoder-decoder seq2seq model (for abstractive text summarization) and I have some embedding layers before the encoder and the decoder.

If I decide to stem during the preprocessing phase, does that mean I should train the embedding layers and never used a pre-trained embedding layer such as GloVe, where I can see the keys are not stemmed?

Should I perform some pre-processing of the keys from GloVe? I can force lots problems if someone did that, since given a root, which vector would you apply?

How should I think about this? What are the most common scenarios when it comes to making the decisions of whether to stem and whether to use pre-trained embedding layers.

Thank you.

Hi @newboadki

I’m not sure you understand, but these are two separate questions/topics (stemming and pre-training). I say that because of:

So, stemming or not stemming is a different question from whether to use pre-trained embeddings or not. These two questions have nothing in common. The analogy to illustrate the point - should I use stemming or should I use a GPU.

Stemming is an old technique which has it uses for certain applications - usually simple. Modern pre-processing techniques makes use of Byte Pair Encoding and other algorithms to create tokens (usually subwords).
So, if you are playing around with the toy problem, then decision to stem or not to stem would come from simple trials (are your results better or not). But if you are trying to solve some complex NLP problem, stemming most probably is not a solution.

Another question that you have is whether to use pre-trained embeddings or not. The answer again depends on your application. But one important thing to understand that embeddings are an integral part of the model - in other words, you cannot take embeddings from one model and use it another with different goals and architecture they were designed for.
So, again, if you’re just playing around with Glove embeddings to find some correlations or whatever, then you can use them with caution (do not try to see things after the analysis; on the contrary, first have a hypothesis only then check if it’s true). And, if you are building a more sophisticated application, you could use something like BERT. Note that the tokenizer is already fixed but you can fine-tune the model for your application (an explanation on BERT embeddings).

Cheers

Thank you for your reply and the links you provided.

Some follow up questions.

I get that stemming and using pre-trained embeddings (or embeddings in general) are two separate topics.

And I get from your answer that using both together might not be common. But my original question was more about understanding the potential scenario of wanting to stem and also use a pre-trained word embedding like Glove. I remember seeing some code samples in Kaggle, that I can’t find anymore, where an embedding layer was created using the Glove embeddings. The way this was done was by placing the word vector coefficients at specific indexes in the matrix. The indexes were the tokens assigned by the Tokenizer from TensorFlow.

It was this, that made me link the stemming and the construction of that embedding layer, since the word keys from Glove are not stemmed.

It is worth mentioning that this example was not stemming, thus my question of what should be done if the two were to be used together, or if they were never used together.

Regarding your comment

you cannot take embeddings from one model and use it another with different goals and architecture they were designed for.
Blockquote

I was referring to a pre-trained embedding layer, not to be further modified during the training of my model. That should be OK, right?

Cheers

The saying goes: never say never :slight_smile: but in this case it is close to never. Glove embeddings are not context dependent which means that what the words can represent are very constrained. On top of that stemming would constrain the representations even further while loosing the information and saving only a small amount of space (a bit smaller vocabulary).
To better understand what I’m saying, it is better to make things concrete by an example. - in Glove, the embedding of the word “bank” is the same for “river bank” and “bank account” which means that the embedding is lacking some information about the word. Now think of what happens when you stem the word “banking” - it becomes the word “bank”, which further reduces the information the model can have - it cannot distinguish between “banking” and “bank”.
In other words, you would loose much more than you would gain. The model would start poorly after stemming and would converge to an eventually inferior (in performance) model.

I’m not sure I understand your question.
There are different fine-tuning techniques. You can fine-tune only some layers (including the embedding layer, or only the embedding layer), you can fine-tune all layers (including the embedding layer) or you can fine-tune additional layers (leaving the embedding layer and other original layers frozen) and also other fine-tuning techniques. So it depends on case by case.

Cheers

Thank you for your answer @arvyzukai !! I hadn’t noticed the new reply.