Why using LSTMs for NER

Hello everyone, I finished course 3 of week 3 and I have just learnt about name entity recognition (NER). I understand that an algorithm such as an LSTM can be trained to predict named entities, but I do not understand fundamentally why LSTMs are of great help for that task.

  1. Isn’t it sufficient to store all words corresponding to named entities in a Python dictionary, where each word is mapped to their corresponding label? Why using an LSTM for that task when that simple approach seems to do the trick? I am sorry if that question seems a bit naive but I am relatively new to natural language processing (NLP). In the case of part of speech (POS) tagging (in course 2 about probabilistic models for NLP), using dictionaries was a baseline approach that needed improvement as there was ambiguity between words that could correspond to different POS tags.
  2. Would hidden Markov models (cf course 2) perform the task well? In course 2, HMMs were used for POS tagging, but it seems that they could be applicable to NER as well. Is there any information in the research litterature/ internet concerning the relative performance of these two models?

Hi @green_sunset

  1. Counter question to illustrate the point - how would you map “Washington”? Is it a name? Is it a place? When you have a sequence, that helps you determine that. For example, “President Washington …” or “In Washington …”. LSTMs should be superior in this setting.

  2. This is a good question :slight_smile: Short answer - no.
    Long answer - Hidden Markov models care only about the last state, so it would highly depend how you model the last state (does it include some form of history? usually not). In some sense HMMs would be simillar to RNNs if the last state would be “(the hidden state, the current input)” and “act on” these as inputs. Usually HMMs are not used this way and the last state is some static - for example, chess board configuration - it doesn’t matter how and when these chess figures got to the places they are on (with the exeption of the king for stalemate which can easily be incorporated to last state in this case) - what matters is the configuration of the current board.
    Language (and many other domains) on the other hand cares about the complexed sequence history, which is very important - for example, imagine we have a news article about the George Washington monument in Washington D.C. by reporter named Washington. And, let’s say, word number 76 in that article is “Washington”. How to determine which Washington this word represents? Vanilla HMM model would definetely be bad, even biggest RNNs would strugle with this problem (Transformers architecture is a step up but would also have it’s problems). The core nature of the problem is the highly complexed sequence of words’ meanings that cannot be easily modeled by statistical probabilities - it requires understanding, humans’ logical rules (and also multimodal information that is not present in the text, but this is out of scope for NLP).