Stop word and negation in C1_W1_lecture_nb_01_preprocessing

I ain’t too happy about the removal of negation concept by stopword library. Please, process my sentence above and remove all stopwords, am I happy or ain’t I happy? I would like to know?

Hey @ajeancharles,
Why exactly are you asking us to do this, when the lecture notebook has given you the code for this? Just add your example in the notebook, and run the code, and you will get the answer to your question.


We can think, ponder, and ask questions. This is a thinking and reasoning business. The concept of stopwords is flawed. If I say, “I am not happy,” when you remove “I,” “am,” and “not,” you are left with a statement whose sentiment polarity is positive, while the original sentence had a negative sentiment polarity.

There is an ethical responsibility in teaching and coding. Ng likes to make that point.

Here is another discussion of the stopword concept and its limitations.

Thank you. Maybe we should add a warning and disclaimer in the lecture and exercise in question.

The other general point to make here is about the structure of the NLP specialization. In the first course, they are giving us an overview of early techniques in the field and they have some limitations. They do make this point in the lectures at multiple points. E.g. in Week 2 about Naive Bayes, Younes points out the serious limitations of Naive Bayes: it literally doesn’t take the ordering of the words into account. He gives examples of how you can completely change the meaning of a sentence by simply reordering the same words. So this is an intrinsic limitation and the instructors do not paper over that fact. As we proceed through NLP we will learn about Sequence Models and eventually Attention Models, which are the SOTA at this point for dealing with language. Of course they are much more sophisticated and incredibly powerful compared to Logistic Regression or Naive Bayes, but also much more expensive to create and train, so there may be some value in at least being aware of some of the “classic” methods.

Linguistics and computational linguistics predate this. I am reading the classics too. But I went back all the way to De Saussure. Despite the quick success of the “modern approach” (and lord knows, I am successful at work because of it), it does not feel esthetically pleasant.

All of Linguistics is thrown away; somehow, this cannot be smart. No wonder the models are so big and require trillions of parameters. Inductive bias obliges the usage of linguistics somewhere. We create language models as if we are coming from another planet. There is no leveraging of structures of human languages.

At some point, linguistics will have to be brought back. By the way, language is not sequential. Phrases are little trees with the verb as the root. Once one understands that, there is no need for position embedding. You need to embed the {subject, direct object, etc.}. Semantics structure is a hierarchal tower. If you raise children, it is very obvious. You even have a statement in English, “Break it down”. We should be looking for representations of universal (across all cultures) semantic primes and build everything hierarchically on top of that.

Here is an elaboration of my full opinion,

This whole area is incredibly interesting and a huge area for debate over the years and ongoing evolution. But the larger ideas you point to here: just throwing tons of data at the model and letting it train itself without any attempt to do “feature engineering” or to provide it with preknowledge or assumptions about structure or semantics is the method that seems to work the best. Compare to the evolution of game playing strategies from Deep Blue to Alpha Go to Alpha Zero. The “just throw all the data at the model” strategy has proven to be vastly more successful in that space. There was a really interesting discussion of that evolution on Ezra Klein’s podcast last week (the July 11th episode) where he interviewed Demis Hassabis of DeepMind. Worth a listen.

Of course we can’t predict the future: maybe you’re right and the next generation of NLP solutions will start to incorporate preknowledge of linguistics and will be even more successful than the LLMs we have today.

If you accept the fact that we are biological machines, do you have children?
Have you ever observed a child discover the language and learn new words and new concepts?
We have been doing this for a few million years now. No matter what the language is or the degree of sophistication of our societies, we all learn a language.

I have concluded for some time that my son’s brain/or mind is not doing gradient descent. And he can learn new things. On the other hand, once you finish training a model, it cannot learn anymore. And it is not going to cost me millions of dollars to send him to school (hopefully not.)

How do we do it? (I am optimistic that we will learn it) And can we make a machine that learns as we learn? We may have to raise it and send it to school, special schools.

Of course I accept that we are biological machines. I have children and now have a grandchild who is almost 2 years old and it’s fascinating to watch him learn language.

But the question is how do you design a piece of software that works the same way that the human brain does? I don’t doubt that we will eventually figure out a closer approximation to that than we have today, but I am not a researcher and have no idea what the solution will look like or when it will arrive. I would agree with the general statement that intelligence should be “substrate independent”, meaning that you should be able to achieve it on a computer that is not constructed of meat. But how to do that and how long it will take to figure out, I have no idea.

But neuroscience is not a solved problem either, right? Can you explain Consciousness? If you can, then you’d better be writing up the explanation and then waiting for the call from the Nobel Committee.

And if you want to go down that road, a human child does not have some “in built” preconceived knowledge of linguistics either, right? They learn language in a way that I would argue is exactly analogous to the “just throw all the data at the model” method you were expressing doubts about earlier. Your Mother points and says “Oh, look at the kitty”. How is that different?

I guess you could argue that perhaps there are evolved prebuilt structures in the brain that somehow encode some form of linguistics, but my guess is that no neuroscientist can actually point to such structures at the present state of the art. If you have more knowledge in this area, please give us some links to read.

Sounds like reinforcement learning - not NLP at all.

Actually the critical language areas in the brain are known, 1) The Broca’s area is associated with speech production and articulation. 2) The Wernicke’s area is associated with speech understanding. Neuroscience knows this by looking at patients who develop speech disorders after a stroke or a heart attack. These afflictions are called aphasia. One aphasia involves losing the ability to understand speech. The other aphasia concerns the ability to produce coherent, meaningful speech.

“Pointing,” as you point out (no pun), is very important to humans. Children in our species quickly develop a “theory of mind,” which can attribute a “mind” faculty to other species members. Most likely, the first form of communication was movement based. The brain seems to be primarily wired to decide what to do next. Evolution seems to have highjacked the pathways to overlay
auditory communication. Try to talk without using your hand or making facial expressions, even if you are alone in a room on the phone.

Sure, there are parts of the brain that have been identified as being involved in speech and language, but the question is whether a newborn child has any “pre-wired” notion of linguistics or the structure of language. Or can they learn everything they need in order to communicate using language just by “training” their LLM on the sample data that they hear and see? But perhaps that is not even an answerable question.

Our brain is not formed at birth. There are a lot of reasons for that. One limitation is the birth canal. While there is a basic architecture determined by genetics, there is a lot of sculpting done by experience. Experience and knowledge sculpt our brains. As far as primary sensory perceptions are concerned, you have something called “critical periods.” You have to be exposed to specific experiences at a critical time in your development, lest all the wiring involved in supporting said perception die out. If your eyes are not exposed to visual sensation very early, you will not develop the power of vision (permanently). If you are not exposed to speech very early, you may never learn how to speak.

You literally evolve your brain by experience or lack of it, and the process is very Darwinian. If a network is mapped to an experience (internal or external), it is reinforced, or else it dies out. (This might partially explain why repetition is important for human’s learning process)

Hey @ajeancharles and @paulinpaloalto Sir,
This has been a really intriguing thread to read, delving quite deep into the territories of linguistics and how a child learns a language.

Just wanted to add a small thing to this, with regards to the current NLP models that we are learning in the course.

Please note that the stop-words are not a fixed subset of words. This is why you will find in most of the tutorials, people eliminate the word “not” and other negative-sentiment words from the stop-words. So, in your example, we will only eliminate “I” and “am”, and I believe the remaining sentence “not happy” has the same polarity as the original sentence. I hope this helps.


1 Like