I’m trying to make a sentiment analysis classifier with tweets dataset where I’m considering 13 classes for the problem. In my dataset there are a class called “enthusiasm” where some words are capitalized on this class. Do you guys think that during the preprocess put the words in lowercase can affect the performance of my model due to this fact?
I think you’re asking to what degree it can affect the performance (because it surely does). If I had to guess blindly, I think it would not affect much. But the answer is hard to know beforehand. It depends on your application of the model - how different is the “real world” (the inputs that you will provide for this model) from this dataset.
What results do you get when comparing training with validation datasets? (lowercase vs Capitalized) Is the signal as strong with Capitalized words for “enthusiasm” (and overall performance)? If yes, and your future inputs would look similar, then - yes it is probably worth to not lowercase the words.