How would you classify an unseen token that contains both digit and punctuation for example?
In reality you would have to decide yourself how you would handle the situation you described.
In the Lab notebook the token you described would be classified as “–unk_digit–”, because the “priorities” are:
- if any is a digit - (this is the case you descibed)
- if any is a punctuation character
- if any is an upper case character
- if word ends with any noun suffix
- if word ends with any verb suffix
- if word ends with any adjective suffix
- if word ends with any adverb suffix
1 Like