Guidance on Optimizing Text Similarity and Reporting with Transformers and Advanced NLP Techniques

Hello,
I’m Volodymyr,
I’m working on a project where I need to compute similarity scores between text fields from two types of data: articles and company profiles. The goal is to identify which companies are most relevant to each article based on various text fields (e.g., name, description, intro, industries). My pipeline involves calculating similarity scores, aggregating them, and storing results for in-depth analysis.
Are there updated models that might perform better for text similarity in a business context?
Any guidance, resources, or examples would be much appreciated! I’m especially interested in hearing about recent advancements in NLP models or efficient similarity search methods that could enhance this setup. Thank you!

hi @Dachuk

based on cosine similarity, I had gone through short course embedding model: from architecture to implementation. Although I felt it needed more specification but if your nlp text analysis is only on similarity, that should help you understand better.

But I feel when it comes text analysis, crewai course or tool is another great one to try for as it has agent which first provides a response then next agent cross checks or summarizes the previous agent and the final agent provides a concise response.

These two short courses are really great to try forth.

Regards
DP

@Deepti_Prasad Thanks for reply and sharing courses.
But my request is more about internal solution, to not share and publish info to external APIs.
We tried different pre_trained models (sentense_transformers), and have similarity results, but I’m not sure about weight and correctness of results. And our results are far from expected)
We’re trying to find corner issue (model we choose or method how we use it).

is it a question answer model type model?

can you brief what kind of model you created and what results you got and what results were you expecting??

I guess temperature was included in your model!!

We’re using sentence_transformers models (for example - “sentence-transformers/paraphrase-MiniLM-L6-v2”). For input we have vector text, vector of company description (for all companies), and then we use cosine_similarity to compute cosine similarity between samples.

1 Like

did you try sentiment analysis ?

No, we didn’t try sentiment analysis, because as we thought we don’t need it for now.
Why you suppose it can be usefull for similarity score calculation?

although I don’t know your company profile but based on what you described your cosine similarity is based on article and company profile, an article can put forth any details which might give a descriptive analysis as well as industry choice, so other than cosine similarity, you also get to understand why an article matched with a company profile other than just with cosine similarity

for example an article about fashion in Paris is focused, and there are multiple company profile which match based on cosine similarity, but the sentiment analysis will try to match the article on which company profile matches more peculiar about fashion in Paris, like designer pattern, choice, description, or intro

Then using both cosine similarity and sentiment analysis, you build a probability matrix which gives you more better expected output to match articles to company profile. This is hypothetical information as I still don’t have understanding about what kind of data you are working upon as the nlp analysis is also based upon what kind of data you are working on.

Regards
DP

1 Like

Thank you for an idea!

All our texts and company’s profiles are in some type of Law/Banking/Goverment firms documents and descriptions. Also we try to analyze and base on keywords provided in our company profile (Location, Organization, Person etc.) And because of that we didn’t thought about any sentiment analysys for our kind of input. We thought that sentiment it’s only about negative/positive emotional ‘coloring’.
But I suppose it good option to try!
Regards,
Volodymyr

in that case do inculcate NER(Named Entity Recognition)

We’ve used a few NER models like dslim/bert-base-NER and other similar, but for different companies it’s return quite close results for the same text input. Even if this text is deffinetely about only one organization

then probably create your own NER model using the pre trained bert base model, make sure you are very specific algorithm select while creating the name, description and organisation matches.

@Dachuk, honestly

I would like to see your experimental results atleast not that I am interested in your codes or data, but to provide better suggestion one needs to know what you have worked upon in practical about what ever stating, so a more better suggestion could be given.

Try MSFT E5 embedding model with fine tuning they usually offer good performances a normal BERT model without fine tuning will always be less useful