Converting line to tensor by characters instead of words

What is the rational for converting characters to tensor instead words to tensors as we did in previous assignment?

Hi @Reggie_Cushing

In NLP there are various techniques to pre-process / tokenize the text (or natural language in general). Character based models are one of the ways and the course creators wanted you to know that.

Having said that, word/subword to tensor Language Models are more common, usually have higher accuracy and lower computational cost. For example, character based RNNs require much bigger hidden layers to capture long-term dependencies which means higher computational costs (for example, the same paragraph tokenized by words would be 32 tokens long, while the same paragraph could be 300 tokens long if tokenized by characters). On the other hand, think what happens when the word is missspelled :slight_smile: Also, language specific, for example a lot of Lithuanian words share the same stem but have different endings and having all the words as different tokens is not very efficient (subword tokens usually is the best approach).

In other words itโ€™s a tool you should know as an NLP specialist. In my experience Iโ€™ve trained a model on entity matching (is โ€œBMW X3 xDrive20d Steptronic (08/19 - 01/20)โ€ the same car model as โ€œBMW X4 xDrive30i xLine Steptronic (07/19 - 06/20)โ€ or โ€œBMW X3 xDrive30d Advantage Steptronic (04/18 - 06/18)โ€) where there are a lot of misspelled words and certain characters matter a lot while others donโ€™t. Character based LSTMs worked pretty well at the time.

All in all it usually depends on the task you have and if know the technique you can use it or not, but if you donโ€™t - you donโ€™t have that luxury :slight_smile:

Cheers

1 Like