Fine-tuned BERT model, how to deal with abbreviations and English as non-first language?

:smile: ello,

I am a software developer slightly dabbling in ML with not much experience at all. I am using the Hugging Face library using the BERT LLM and fine-tuning it based on some industry-related data to provide a sentiment analysis for groups of text. I work for a healthcare company, and we are trying to get a sentiment analysis of a shift note from a carer, but there are a few problems. I understand that it’s hard to grasp where I’m coming from without seeing the training data-set but using the real-world data the analysis is correct around 87% of the time and it is good at picking up context, however the notes that are incorrect are usually when they have one of those problems in it, so I’m wondering if there is a way to get around this or try make up for these errors?

  1. Abbreviations are used a lot for things like locations, businesses and just general words/phrases. Do I need to add in the abbreviations into the data sets? For example- a note might say “OT today”, meaning they went to the Occupational Therapist today. Which is a positive analysis, so would I need to train the data on the abbreviation as well as the normal word?
    Also when it is an abbreviation of something that I wouldn’t know (which may be in context of their company/location/industry) how could I help the data, for example in a note “Friendlies for a LMW fasting test.” came back negative with 0.99 confidence, and there is nothing in there that would determine negative based on the training data.
  2. A lot of workers do not have English as their first language, so there are misspelled words and the incorrect word (which is a correctly spelt word) based on the context of what they are trying to say- eg. “I didnt no what that is” (no is supposed to be know). Does this have that much of an impact on the performance of the model?
  3. Do numbers play a role in the analysis as there are notes like “Physio at 10” which come back negative, while there would be no data in the data-set that would make this negative.
  4. Does punctuation play a big role in determining the context as well? I would get a note like this that comes back negative 0.85 confidence (names changed for privacy)- “8:30 TL Kieran start shift Ben came and cut trees shapes with jigsaw Planing with John 11:00 TL Kieran finished shift”. Where the person didn’t input any punctuation, and it’s all hard to understand context.

Happy for any sort of advice/direction to look in as I am very new to the area and keen to explore more.

This is a very difficult task if you have not much experience.

Additional study is recommended.

Hello, thanks for the response.
What sort of materials/direction would you recommend to start looking into as I am keen to do additional study? :slight_smile:

To get the basics for understanding large language models, I recommend you start with the Machine Learning Specialization (an introductory course), then the Natural Language Processing Specialization (an intermediate course).