Examples of that lead to stem sunni in first lab that could also be wrong

In the first lab, the stemming part seems to be incomplete in explanation.

They say:

However, in some cases, the stemming process produces words that are not correct spellings of the root word. For example, happi and sunni .

They give examples for happier and happiness but nothing for the stem sunni. Could you give examples? The only one I could think of was sunniness, which I was able to confirm on Merriam Webster.

In what cases would sunni be the wrong stem? They say that happ conflicts with happen and that makes sense but I have no idea what the other conflicts are or how to find them.

It doesn’t say that sunni is the wrong stem.
It says that sunni is not the correct spelling for the related root word “sunny”.

Is sunniness the only word that could lead to that stem?

@here2infinity I see your point here.

It makes sense when they give us ‘happi’, but I can’t think of any other words besides what you suggested that would start with ‘sunni’ either and agree that example is a little confusing…

But as they also mention, the gist here it is looking for the most ‘common’ stems. Thus for the variants of the word ‘happy’, ‘happi’ comes up more often.

By incorrect spellings of the root word, they mean: Well, happier, happiness, etc. The real ‘root word’ here is ‘happy’, but that’s not what the stemming algorithm comes up with, so in that sense it ‘misspells’ the root.

I believe the stemming algorithm is also taking into account the underlying word vector as well.

Thus their point, ‘happ’ might be even a better stem, but it is also too close (in spelling) to ‘happen’. Recall the word vectors (with regards to cosine distance) exhibit some shared relationship between similar words-- But when we stem, it is not ‘just’ a matter of spelling.

This, at least is my understanding.

*I’ll leave what I have just stated for further edification, but I was wrong about word vectors being involved whatsoever. I clicked the link they provide in the lab for the Porter Stemming algorithm, and started going through the Python translation:
https://tartarus.org/martin/PorterStemmer/python.txt

It is basically removing common suffixes and also running a count of vowel-consonant pairs, to determine where to stem.

Even though I have taken NLP, and for quite awhile was an ESL teacher, unfortunately one thing I am not, is a linguist… So off the top of my head I couldn’t tell you how the significance of that measurement is used.

1 Like

Thanks. I haven’t gotten to word vectors yet and to cosine distance just yet.

I think the main issue here is that “sunni” is a really bad example of a stemmed word.

1 Like

@here2infinity please see my updated post above-- I was wrong about the word vector part.

1 Like

what about sunnier?
if you compare happier and happiness,

it can be sunnier and sunniness.

but not that it mentions the root words have incorrect spelling as mentioned by the other mentor too.

root words correct spelling would happy and sunny

stemming merely removes common suffixes from the end of word tokens. So basically they are more trying to explain what stemming does when it comes text processing in NLP as they used

the two words happier and happiness, and sunniness, stemming lead to the word happi and Sunni which had incorrect spelling.

Regards
DP

1 Like