Examples of that lead to stem sunni in first lab that could also be wrong

here2infinity · August 30, 2024, 6:25pm

In the first lab, the stemming part seems to be incomplete in explanation.

They say:

However, in some cases, the stemming process produces words that are not correct spellings of the root word. For example, happi and sunni .

They give examples for happier and happiness but nothing for the stem sunni. Could you give examples? The only one I could think of was sunniness, which I was able to confirm on Merriam Webster.

In what cases would sunni be the wrong stem? They say that happ conflicts with happen and that makes sense but I have no idea what the other conflicts are or how to find them.

TMosh · August 30, 2024, 6:27pm

It doesn’t say that sunni is the wrong stem.
It says that sunni is not the correct spelling for the related root word “sunny”.

here2infinity · August 30, 2024, 7:21pm

Is sunniness the only word that could lead to that stem?

Nevermnd · August 30, 2024, 7:31pm

@here2infinity I see your point here.

It makes sense when they give us ‘happi’, but I can’t think of any other words besides what you suggested that would start with ‘sunni’ either and agree that example is a little confusing…

But as they also mention, the gist here it is looking for the most ‘common’ stems. Thus for the variants of the word ‘happy’, ‘happi’ comes up more often.

By incorrect spellings of the root word, they mean: Well, happier, happiness, etc. The real ‘root word’ here is ‘happy’, but that’s not what the stemming algorithm comes up with, so in that sense it ‘misspells’ the root.

I believe the stemming algorithm is also taking into account the underlying word vector as well.

Thus their point, ‘happ’ might be even a better stem, but it is also too close (in spelling) to ‘happen’. Recall the word vectors (with regards to cosine distance) exhibit some shared relationship between similar words-- But when we stem, it is not ‘just’ a matter of spelling.

This, at least is my understanding.

*I’ll leave what I have just stated for further edification, but I was wrong about word vectors being involved whatsoever. I clicked the link they provide in the lab for the Porter Stemming algorithm, and started going through the Python translation:
https://tartarus.org/martin/PorterStemmer/python.txt

It is basically removing common suffixes and also running a count of vowel-consonant pairs, to determine where to stem.

Even though I have taken NLP, and for quite awhile was an ESL teacher, unfortunately one thing I am not, is a linguist… So off the top of my head I couldn’t tell you how the significance of that measurement is used.

here2infinity · August 30, 2024, 7:41pm

Nevermnd:

akes sense when they give us ‘happi’, but I can’t think of any other words besides what you suggested that would start with ‘sunni’ either and agree that example is a little confusing…

But as they also mention, the gist here it is looking for the most ‘common’ stems. Thus for the variants of the word ‘happy’, ‘happi’ comes up more often.

By incorrect spellings of the root word, they mean: Well, happier, happiness, etc. The real ‘root word’ here is ‘happy’, but that’s not what the stemming algorithm comes up with, so in that sense it ‘misspells’ the root.

I believe the stemming algorithm is also taking into account the underlying word vector as well.

Thus their point, ‘happ’ might be even a better stem, but it is also too close (in spelling) to ‘happen’. Recall the word vectors (with regards to cosine distance) exhibit some shared relationship between similar words-- But when we stem, it is not ‘just’ a matter of spelling

Thanks. I haven’t gotten to word vectors yet and to cosine distance just yet.

TMosh · August 30, 2024, 7:43pm

I think the main issue here is that “sunni” is a really bad example of a stemmed word.

Nevermnd · August 30, 2024, 7:43pm

@here2infinity please see my updated post above-- I was wrong about the word vector part.

Deepti_Prasad · August 30, 2024, 8:01pm

what about sunnier?
if you compare happier and happiness,

it can be sunnier and sunniness.

but not that it mentions the root words have incorrect spelling as mentioned by the other mentor too.

root words correct spelling would happy and sunny

stemming merely removes common suffixes from the end of word tokens. So basically they are more trying to explain what stemming does when it comes text processing in NLP as they used

the two words happier and happiness, and sunniness, stemming lead to the word happi and Sunni which had incorrect spelling.

Regards
DP

Topic		Replies	Views
Stop word and negation in C1_W1_lecture_nb_01_preprocessing NLP with Classification and Vector Spaces week-1	15	526	July 19, 2023
C5 W2 A1: Analogy finding doesn't seem to work that good Sequence Models	3	410	July 10, 2023
Are errors in question? NLP with Classification and Vector Spaces week-2 , week-3	1	455	June 5, 2023
Errata in last vectorization video Advanced Learning Algorithms week-1	1	482	August 3, 2022
Week 2, video: Training Naïve Bayes NLP with Classification and Vector Spaces week-2 , week-3	2	514	November 29, 2022

Examples of that lead to stem sunni in first lab that could also be wrong

Related topics