How is LSTM connected with image captioning?

Is it something like OCR systems? If yes, I don’t understand how it is connected with LSTMs too :sweat_smile:


Hello @someone555777 ,

LSTM is used as the decoder in an image captioning system .
LSTM , as a decoder, takes the features extracted from an image by a convolutional neural network (CNN) as input and generates a sequence of words that is described by the image.

With regards,
Nilosree Sengupta

So, ok, as I thought it is something like OCR. But why is specially LSTM? As I understand any model can be used after CNN, isn’t it?

Hello @someone555777 ,

There are a list of reasons for preferring LSTMs :

  1. LSTMs lcaptures long-term dependencies, which is important for captioning images because the context of the words around an image might affect its interpretation.

  2. By capturing both short and long-term dependencies, LSTMs can interpret images better and produce appropriate descriptions.

  3. It can handle sequences of arbitrary length.

  4. It generates natural and fluent descriptions with correct grammar.

  5. Better performance.

Hope this helps.

With regards,
Nilosree Sengupta

1 Like

so, is’t about when we have a lot of text on image?