Advice on Information Retrieval Implementation with Naive Bayes

ksenia-5 · May 30, 2023, 7:31am

Hi!
NLP Course 1 Week 2 mentions information retrieval as one of the useful applications of Naive Bayes.

Could you suggest resources to implement this?

I have a list of key words from a text that I want to use to search and rank tweets by relevance. What would be the best / easiest way to achieve this?

Any advice or resources welcome!

Elemento · June 1, 2023, 4:24pm

Hey @ksenia-5,
Welcome, and we are glad that you could become a part of our community

I guess we all have access to Internet for this. A quick search led me to the following resources. You can check them out, and find other resources accordingly on the web.

Application of Naive Bayes Classifier for Information Extraction
Improving Semantic Information Retrieval Using Multinomial Naive Bayes Classifier and Bayesian Networks

Let us know if these help.

Cheers,
Elemento

arvyzukai · June 2, 2023, 5:01am

Hi @ksenia-5

To complement @Elemento’s answer:

If you’re just having fun and experimenting, there are datasets specifically designed for that purpose. You can easily find one that suits your idea and dive right in.

I understand that my response might not directly answer your question, but if you’re looking for a more serious approach, Naive Bayes may not be the best choice for information retrieval nowadays.
If you want a straightforward solution, you should consider prompt engineering.
However, if you’re after a more advanced and sophisticated method, exploring vector databases would be a worthwhile option.

Cheers

Elemento · June 2, 2023, 7:08am

Hey @arvyzukai,
Thanks for sharing about vector databases, I didn’t know that something like this existed. By the way, can you please elaborate a bit more about how we can use prompt engineering for information retrieval.

Say I have a 1000 documents, and I have 5-10 keywords, based on which I need to extract the 10 most relevant documents along with their rankings. Can we use something like ChatGPT to achieve this task (using either their website or API), or do we need to use some special plug-in of ChatGPT for this. I assume that it won’t take a prompt in which we embed the content of all the 1000 documents, since the prompt will become excessively large in that case?

Cheers,
Elemento

arvyzukai · June 2, 2023, 8:07am

Hey @Elemento

To be honest I have no real experience of using ChatGPT for information retrieval (or real use case for that matter), so it’s just my thoughts

I don’t know if you explored the “Building Systems with the ChatGPT API” but there are some thoughts that could be used for informational retrieval (in particular “Classification” and “Chain of Thought Reasoning”).

But to answer your question, it really depends on what you’re trying to achieve with the system (depends on how important the accuracy is, speed, how you provide value (make money, etc.), how big are the documents (50 pages, or just 1 tweet, etc.).
But let’s imagine your example having 1000 news articles (around 1 page each) and 5-10 keywords representing some news categories or similar.

In this case, here’s a possible approach:

Generating Keywords and Indexing:
- We can use ChatGPT to come up with 7-10 keywords for each news article.
- These keywords can be stored in a smart way, maybe organized hierarchically, to create an index (index creation might involve many iterative improvements).
- We’ll also keep track of which keywords correspond to which news articles.
Querying and Ranking:
- To find relevant documents, we can use a simple SQL query to filter based on the desired news categories.
- From this subset, we’ll present the documents to ChatGPT and ask which ones match the keywords the best.
- Based on ChatGPT’s response, we’ll rank and prioritize the documents to show the most relevant ones first.

Again, just to reiterate I have no real experience in this and maybe this might not work at all But I think I would try something like this and iterative upgrading might lead to some working solution… I don’t know… What do you think?

P.S. to make things more concrete I could imagine the prompt for creating index:
“”“Create 7-10 words keywords of this article for indexing. The keywords will be later used for queries so make these keywords 1 or 2 words long and as concise as possible. Output in JSON format.”“”
When quering:
“”“I have these unique keywords in my dabatase that were created from articles:
<kw1>
<kw2>
…
Match the following user keywords that are likely to best match the artiles:
<usr_kw1>
<usr_kw2>
…
“””
Then you retrieve the documents with regular SQL.
Then you again ask ChatGPT to rank them by relevance.

Elemento · June 2, 2023, 4:59pm

Hey @arvyzukai,
You say you don’t have any experience, but your answer seems to be an amazing one. You just laid out all the steps in which we can perform information retrieval, using the ChatGPT API. As for the course, I haven’t gone over it yet, but will definitely check it out.

Initially after reading your answer, I thought that what you proposed is only suitable for a one-off process, and not if we want to build a information retrieval application, but after giving it another thought, your approach seems to be highly scalable as well.

For instance, say for our application, we have 1 million 1 page medical documents. Since we only need to extract the keywords for each of the documents only once, we can easily do this via ChatGPT API, and store them in a hierarchical database as you mentioned. Now, for each of the user queries, we just need to find the most similar keywords to the user’s from the unique keywords that ChatGPT generated, and then a SQL query to find the top 10 documents, and the just one another ChatGPT to rank those 10 documents, and voila, we are done

Once again, an amazing answer indeed

Cheers,
Elemento

arvyzukai · June 2, 2023, 5:21pm

Thank you @Elemento for your kind words! I truly appreciate your praise.

In reality this might or might not work and I would be very cautions/reluctant on endeavor with “medical documents”

Cheers

P.S. btw, if you have time, LangChain for LLM Application Development might also offer additional ideas on this topic

Elemento · June 2, 2023, 5:48pm

Hey @arvyzukai,

Indeed, that is true. I just considered medical documents for a toy example, but I believe that for most domain-specific applications, specific models might be able to outshine ChatGPT in terms of performance. This gap could be increased if the samples in our dataset are not exposed to the public use, and hence, chances exist that ChatGPT isn’t trained on any related documents, as such.

Indeed, that is on my to-do list, since the day it released

Cheers,
Elemento

Topic		Replies	Views
What if we have the same frequency score on both a positive and a negative tweet NLP with Classification and Vector Spaces week-1	1	547	December 31, 2021
About the NLP with Classification and Vector Spaces category NLP with Classification and Vector Spaces	0	994	August 11, 2021
I need assistance with completing the naive Bayes programming assignment Probability & Statistics for Machine Learning &... week-1	4	512	January 24, 2024
Common approach for NLP using naive bayas NLP with Classification and Vector Spaces week-2 , week-3	1	420	July 19, 2023
✨ New course! Enroll in Retrieval Optimization: From Tokenization to Vector Quantization News and Announcements	3	70	October 4, 2024

Advice on Information Retrieval Implementation with Naive Bayes

Related topics