Guidance on Optimizing Text Similarity and Reporting with Transformers and Advanced NLP Techniques

Dachuk · November 7, 2024, 2:50pm

Hello,
I’m Volodymyr,
I’m working on a project where I need to compute similarity scores between text fields from two types of data: articles and company profiles. The goal is to identify which companies are most relevant to each article based on various text fields (e.g., name, description, intro, industries). My pipeline involves calculating similarity scores, aggregating them, and storing results for in-depth analysis.
Are there updated models that might perform better for text similarity in a business context?
Any guidance, resources, or examples would be much appreciated! I’m especially interested in hearing about recent advancements in NLP models or efficient similarity search methods that could enhance this setup. Thank you!

Deepti_Prasad · November 7, 2024, 3:23pm

hi @Dachuk

based on cosine similarity, I had gone through short course embedding model: from architecture to implementation. Although I felt it needed more specification but if your nlp text analysis is only on similarity, that should help you understand better.

But I feel when it comes text analysis, crewai course or tool is another great one to try for as it has agent which first provides a response then next agent cross checks or summarizes the previous agent and the final agent provides a concise response.

These two short courses are really great to try forth.

Regards
DP

Dachuk · November 7, 2024, 4:23pm

@Deepti_Prasad Thanks for reply and sharing courses.
But my request is more about internal solution, to not share and publish info to external APIs.
We tried different pre_trained models (sentense_transformers), and have similarity results, but I’m not sure about weight and correctness of results. And our results are far from expected)
We’re trying to find corner issue (model we choose or method how we use it).

Deepti_Prasad · November 7, 2024, 4:42pm

is it a question answer model type model?

can you brief what kind of model you created and what results you got and what results were you expecting??

I guess temperature was included in your model!!

Dachuk · November 7, 2024, 5:20pm

We’re using sentence_transformers models (for example - “sentence-transformers/paraphrase-MiniLM-L6-v2”). For input we have vector text, vector of company description (for all companies), and then we use cosine_similarity to compute cosine similarity between samples.

Deepti_Prasad · November 7, 2024, 6:06pm

did you try sentiment analysis ?

Dachuk · November 8, 2024, 8:32am

No, we didn’t try sentiment analysis, because as we thought we don’t need it for now.
Why you suppose it can be usefull for similarity score calculation?

Deepti_Prasad · November 8, 2024, 10:10am

although I don’t know your company profile but based on what you described your cosine similarity is based on article and company profile, an article can put forth any details which might give a descriptive analysis as well as industry choice, so other than cosine similarity, you also get to understand why an article matched with a company profile other than just with cosine similarity

for example an article about fashion in Paris is focused, and there are multiple company profile which match based on cosine similarity, but the sentiment analysis will try to match the article on which company profile matches more peculiar about fashion in Paris, like designer pattern, choice, description, or intro

Then using both cosine similarity and sentiment analysis, you build a probability matrix which gives you more better expected output to match articles to company profile. This is hypothetical information as I still don’t have understanding about what kind of data you are working upon as the nlp analysis is also based upon what kind of data you are working on.

Regards
DP

Dachuk · November 8, 2024, 10:58am

Thank you for an idea!

All our texts and company’s profiles are in some type of Law/Banking/Goverment firms documents and descriptions. Also we try to analyze and base on keywords provided in our company profile (Location, Organization, Person etc.) And because of that we didn’t thought about any sentiment analysys for our kind of input. We thought that sentiment it’s only about negative/positive emotional ‘coloring’.
But I suppose it good option to try!
Regards,
Volodymyr

Deepti_Prasad · November 8, 2024, 12:13pm

in that case do inculcate NER(Named Entity Recognition)

Dachuk · November 8, 2024, 2:36pm

We’ve used a few NER models like dslim/bert-base-NER and other similar, but for different companies it’s return quite close results for the same text input. Even if this text is deffinetely about only one organization

Deepti_Prasad · November 8, 2024, 2:43pm

then probably create your own NER model using the pre trained bert base model, make sure you are very specific algorithm select while creating the name, description and organisation matches.

@Dachuk, honestly

I would like to see your experimental results atleast not that I am interested in your codes or data, but to provide better suggestion one needs to know what you have worked upon in practical about what ever stating, so a more better suggestion could be given.

JPRMohnish · November 9, 2024, 2:13pm

Try MSFT E5 embedding model with fine tuning they usually offer good performances a normal BERT model without fine tuning will always be less useful

Topic		Replies	Views
Feedback on E-commerce Product Similarity Model for Fine-Tuning AI Discussions ai-discussions , natural-language-pro , project	1	69	October 3, 2024
Similarity search fails to capture product numbers LangChain for LLM Application Development	0	104	July 16, 2023
C1_W3_Assignment - Getting Tunisia as output of get_country NLP with Classification and Vector Spaces week-module-3	5	20	May 31, 2025
Which model should I use for LLM application AI Discussions	1	174	January 6, 2024
Why not use coef of correlation instead of cosine similarity between embedding vectors AI Discussions ai-discussions , llm	1	104	May 9, 2024

Guidance on Optimizing Text Similarity and Reporting with Transformers and Advanced NLP Techniques

Related topics