Considerations on Stemming and Pre-trainned word embeddings

newboadki · September 21, 2023, 7:58am

Hi,

Assuming I have an encoder-decoder seq2seq model (for abstractive text summarization) and I have some embedding layers before the encoder and the decoder.

If I decide to stem during the preprocessing phase, does that mean I should train the embedding layers and never used a pre-trained embedding layer such as GloVe, where I can see the keys are not stemmed?

Should I perform some pre-processing of the keys from GloVe? I can force lots problems if someone did that, since given a root, which vector would you apply?

How should I think about this? What are the most common scenarios when it comes to making the decisions of whether to stem and whether to use pre-trained embedding layers.

Thank you.

arvyzukai · September 21, 2023, 11:16am

Hi @newboadki

I’m not sure you understand, but these are two separate questions/topics (stemming and pre-training). I say that because of:

So, stemming or not stemming is a different question from whether to use pre-trained embeddings or not. These two questions have nothing in common. The analogy to illustrate the point - should I use stemming or should I use a GPU.

Stemming is an old technique which has it uses for certain applications - usually simple. Modern pre-processing techniques makes use of Byte Pair Encoding and other algorithms to create tokens (usually subwords).
So, if you are playing around with the toy problem, then decision to stem or not to stem would come from simple trials (are your results better or not). But if you are trying to solve some complex NLP problem, stemming most probably is not a solution.

Another question that you have is whether to use pre-trained embeddings or not. The answer again depends on your application. But one important thing to understand that embeddings are an integral part of the model - in other words, you cannot take embeddings from one model and use it another with different goals and architecture they were designed for.
So, again, if you’re just playing around with Glove embeddings to find some correlations or whatever, then you can use them with caution (do not try to see things after the analysis; on the contrary, first have a hypothesis only then check if it’s true). And, if you are building a more sophisticated application, you could use something like BERT. Note that the tokenizer is already fixed but you can fine-tune the model for your application (an explanation on BERT embeddings).

Cheers

newboadki · September 21, 2023, 3:08pm

Thank you for your reply and the links you provided.

Some follow up questions.

I get that stemming and using pre-trained embeddings (or embeddings in general) are two separate topics.

And I get from your answer that using both together might not be common. But my original question was more about understanding the potential scenario of wanting to stem and also use a pre-trained word embedding like Glove. I remember seeing some code samples in Kaggle, that I can’t find anymore, where an embedding layer was created using the Glove embeddings. The way this was done was by placing the word vector coefficients at specific indexes in the matrix. The indexes were the tokens assigned by the Tokenizer from TensorFlow.

It was this, that made me link the stemming and the construction of that embedding layer, since the word keys from Glove are not stemmed.

It is worth mentioning that this example was not stemming, thus my question of what should be done if the two were to be used together, or if they were never used together.

Regarding your comment

you cannot take embeddings from one model and use it another with different goals and architecture they were designed for.
Blockquote

I was referring to a pre-trained embedding layer, not to be further modified during the training of my model. That should be OK, right?

Cheers

arvyzukai · September 22, 2023, 5:41am

The saying goes: never say never but in this case it is close to never. Glove embeddings are not context dependent which means that what the words can represent are very constrained. On top of that stemming would constrain the representations even further while loosing the information and saving only a small amount of space (a bit smaller vocabulary).
To better understand what I’m saying, it is better to make things concrete by an example. - in Glove, the embedding of the word “bank” is the same for “river bank” and “bank account” which means that the embedding is lacking some information about the word. Now think of what happens when you stem the word “banking” - it becomes the word “bank”, which further reduces the information the model can have - it cannot distinguish between “banking” and “bank”.
In other words, you would loose much more than you would gain. The model would start poorly after stemming and would converge to an eventually inferior (in performance) model.

I’m not sure I understand your question.
There are different fine-tuning techniques. You can fine-tune only some layers (including the embedding layer, or only the embedding layer), you can fine-tune all layers (including the embedding layer) or you can fine-tune additional layers (leaving the embedding layer and other original layers frozen) and also other fine-tuning techniques. So it depends on case by case.

Cheers

newboadki · September 29, 2023, 11:49am

Thank you for your answer @arvyzukai !! I hadn’t noticed the new reply.

Topic		Replies	Views
General Question about Vector Space NLP with Classification and Vector Spaces week-3	5	332	August 13, 2022
Preprocessing \| More Information NLP with Classification and Vector Spaces week-1	2	541	April 18, 2022
Word embeddings for words with possibly different meanings Sequence Models coursera-platform	1	496	November 29, 2022
LLM Paper - Knowledge AI Discussions ai-discussions , large-language-model	6	256	February 18, 2024
A simple question about text preprocessing prerequisites NLP with Sequence Models week-1	1	240	February 13, 2024

Considerations on Stemming and Pre-trainned word embeddings

Related topics