Why logistic regression for NLP when DL methods are superior?

After Deep Learning specialization, I thought for most of the problems whether structured, image or NLP, deep learning methods are superior to other conventional methods. The method suggested in the first week of NLP course does not seem efficient. Is this method used at all in industry? For example, getting frequencies of positive and negative sentiment words looks very naive - the relationship between the words are important and word embeddings are meant for that. A deep learning neural network model using word embeddings would be far superior method for sentiment analysis owing to these possibilities. Why these logistic regression methods which require such preprocessing as creating frequency dictionary is taught at all? I doubt it is used in any production settings after the entry of deep learning methods

Hi @nmurugesh,

The idea is to teach the basic and then move on to advance concepts. This is the same as, you cannot teach a child multiplication without first teaching them addition.

And sometimes, as Andrew would argue, the basic models perform much better on certain tasks than advance ones. So it is important to know all.



Right! It’s also giving you the history of how the field developed. That way you’ll also have better understanding to appreciate how and why the newer methods are better. It’ll be exactly the same with the Naive Bayes in Week 2. Younes even explains in the lectures why that method is not really used anymore: Naive Bayes does not take into account the order of the words, which is why the Sequence Models that we’ll learn about later are so much more powerful. Of course Naive Bayes is vastly cheaper to implement and it may be effective in some cases, so it is good to know about it. And it is interesting to see the methods that have been used over the history of the field.


Ok then it’s fine! I was worried that I may not be aware of existence of some techniques that are widely used in production.

Thanks for the reply. I was not aware that logistic regression methods can also be used. I
did some Udemy course on Tensor flow Developer certificate (though superfluously!) and there was a case study (I guess on creating abstracts from journal papers or something like that) in which the Bayes’ method was used as baseline to compare various DL models. And finally, Bayes’ method that used sci-kit library had beaten all DL models!! I started wondering about the efficiency of DL methods but thankfully persisted and later took up Andrew Ng’s Deep Learning specialization and learned a lot more about the potential of DL models.

(I guess the explanation was that the dataset was not properly labelled or something like that…did not delve deeper into the reasons …)

I think we should all know the basic first. Learn to walk before you can run, right?

Well, I thought that this course will further build on the NLP concepts already taught in the Deep Learning Specialization course!

Also, I was not aware that many of these methods exists. Irrespective of these methods, I also thought that DL models have already superseded many of the age-old NLP methods. I am still not convinced that all of the methods taught here are relevant for understanding the state-of-the-art NLP methods.

More particularly, I was studying today the Viterbi algorithm - Initially, I could not understand why and where this algorithm is used for, at all!! :slight_smile: Later, reading some articles, I found that it is used for POS tagging. Another article from medium said that Deep Learning models can very well be used for POS tagging - as I was wondering correctly initially. But it says a GRU/LSTM model can match the accuracy of viterbi algorithm but computational cost is higher. But I am not convinced, do we not have a DL model that makes viterbi algorithm useless yet?!

Have you tried to kill a fly with a bazooka? If all you know is bazooka, then everything you see is a target.

Decision Trees, Naive Bayes, Logistic Regression etc. are all widely used techniques today. They are cheap to train, you can train them fast, you can use them almost on any device and they are can achieve pretty good accuracy depending on the task.

Do you have millions of dollars to train something like GPT-3? Even in you have, how will you make profit? How you will be able to use it (it won’t run on your mobile phone in the near future)? How much time do you have for prediction? And thera are many other drawbacks that come with big models.

For example, if you want to code up your own email sorter that sorts your emails to certain directories. Do you need huge model for that? Most certainly not unless this operation alone is buisness critical and it makes you a lot of money. But if having couple redundant copies of the same email is ok, then “age-old NLP methods” would do just fine :slight_smile:


In a course on Recommender Systems in Udemy, the author reiterated that it is a stated policy in Google to use only Deep Learning methods organization-wide for all kinds of machine learning problems (though companies like Netflix use every possible and different methods for different recommender problems). So, I thought all problems small and big, as along as it is a machine learning problem, can be solved through DL models (where huge data is required, transfer learning is there). Moreover, after understanding the Viterbi algorithm, I thought it is a fit problem for a neural network.

Most of the sci-kit learn library problems can be solved by a sample set of data up to 100 samples or so (as it has always been with SAS or SPSS before). The “big data” driven problems are more apt to be solved by DL methods was my conclusion.

In fact, I even thought I should study DL methods for Unsupervised learning like Deep Embedded Clustering or any such other such methods…) :slight_smile:

Hi, I have come across this TF-IDF vectorization in scikit learn library - I guess not covered in our NLP specialization. What is the equivalent for this in DL models? Is this outdated even in shallow NLP machine learning? Can you throw some light on this?

As you might know, there are a number of “age-old” text vectorization techniques like one-hot, bag-of-words, n-gram, tf-idf, pmi and other. They have their own strengths and weaknesses (depending on your goal, training data and many other factors).

To mention some, the strengths are that they are usually fast, cheap, some easily interpretable(like BoW), and sometimes even work better with small datasets.
The weaknesses are that they do not capture word/token positions (absolute and relative), they can become too sparse (with big vocabularies) and computationally expensive. In other words, they do not capture “semantics” and usually are worse for big datasets and complex problems.

Loosely speaking, the Embedding layer (and positional encoding) is the equivalence in DL to “traditional” text vectorization. As you might learned, there are a number of these word embedding techniques like Word2Vec, GloVe, FastText, ELMo, BERT and many variants of these. They are usually adapted to specific problems and models (different approaches might work better for different applications; like text translation vs. text summarization etc.).

Their advantage is that they capture syntactic and semantic relationship. They are more suitable for complex problems.
Their disadvantages are that they are slower, costly, data hungry and other.

I would not call them outdated as I wouldn’t call Python lists outdated compared to JAX arrays, or for loops outdated compared to vectorized computations. They simply have their own place in the world :slight_smile:

In my view, your comments could be valid if we don’t have the concept of pre-trained models. We have established SOTA pre-trained models available which we can load without fine-tuning and get the job done to get decent answers, or we can finetune them for our dataset to get more accurate models than existing traditional machine learning methods.

My view is that we are learning these techniques because they are the established methods, and there are possibilities to discover more from basic methods. Moreover, the DL methods although are accurate seems a black box which doesn’t allow us to decode the techniques behind it.

We can also refine our traditional methods more and more to reach the highly accurate DL methods – this enhances our understanding of the traditional ML methods as well as challenges us to add creative thoughts to tune the traditional ML to reach for the ranges of accuracy reported in SOTA DL methods.