Not clear how LDA is useful or the best tool

When I look at the words in different generated topics I find it hard to see what the theme of each really is. And I worry that LDA is too crude since it ignores word order, handles negation and synonyms poorly, and generally ignores semantics.

My impression is that there are much better tools today that generate embeddings for entire sentences or posts. These can be clustered and the measure of how close some new text is to one of these clusters is better than what LDA provides. An old but freely available example is Google’s [Universal Sentence Encoder] Universal Sentence Encoder  |  TensorFlow Hub). No doubt newer models are even better.

Is my impression that all this about removing stop words, lemmatization, etc. were state-of-the-art many years ago but are they still competitive with new techniques?

Thanks for your comment @toontalk! You’re certainly correct that LDA is not a cutting edge technique. I would love to see different approaches to topic modeling for the data in this lab so please do share if you explored some other alternatives. In this case, like in many of the other labs in this program we went with simpler and hopefully more interpretable methods. We did this in hopes of giving less technical folks a means of understanding what’s going on in the labs but also in the spirit of what Robert talks about in favoring well understood, more interpretable techniques in crisis situations rather than SOTA algorithms (“Land Cruiser” solutions). But that’s not to say there’s not room for improvement on the model results so keep exploring!

2 Likes

Thanks. That all makes sense. I guess there is a trade-off as to whether the course is teaching technical details that are useful in the short-term but are likely to be superceded in the medium and long term.

Also I think at the minimum the land cruiser metaphor should have been discussed in the context of LDA - at least a few sentences that it is simpler and well-understood compared to more modern techniques that have many advantages such as much better capture of the semantics of the texts.