Advice on Information Retrieval Implementation with Naive Bayes

Hi!
NLP Course 1 Week 2 mentions information retrieval as one of the useful applications of Naive Bayes.

Could you suggest resources to implement this?

I have a list of key words from a text that I want to use to search and rank tweets by relevance. What would be the best / easiest way to achieve this?

Any advice or resources welcome!

Hey @ksenia-5,
Welcome, and we are glad that you could become a part of our community :partying_face:

I guess we all have access to Internet for this. A quick search led me to the following resources. You can check them out, and find other resources accordingly on the web.

Let us know if these help.

Cheers,
Elemento

Hi @ksenia-5

To complement @Elemento’s answer:

If you’re just having fun and experimenting, there are datasets specifically designed for that purpose. You can easily find one that suits your idea and dive right in.

I understand that my response might not directly answer your question, but if you’re looking for a more serious approach, Naive Bayes may not be the best choice for information retrieval nowadays.
If you want a straightforward solution, you should consider prompt engineering.
However, if you’re after a more advanced and sophisticated method, exploring vector databases would be a worthwhile option.

Cheers

Hey @arvyzukai,
Thanks for sharing about vector databases, I didn’t know that something like this existed. By the way, can you please elaborate a bit more about how we can use prompt engineering for information retrieval.

Say I have a 1000 documents, and I have 5-10 keywords, based on which I need to extract the 10 most relevant documents along with their rankings. Can we use something like ChatGPT to achieve this task (using either their website or API), or do we need to use some special plug-in of ChatGPT for this. I assume that it won’t take a prompt in which we embed the content of all the 1000 documents, since the prompt will become excessively large in that case?

Cheers,
Elemento

Hey @Elemento

To be honest I have no real experience of using ChatGPT for information retrieval (or real use case for that matter), so it’s just my thoughts :slight_smile:

I don’t know if you explored the “Building Systems with the ChatGPT API” but there are some thoughts that could be used for informational retrieval (in particular “Classification” and “Chain of Thought Reasoning”).

But to answer your question, it really depends on what you’re trying to achieve with the system (depends on how important the accuracy is, speed, how you provide value (make money, etc.), how big are the documents (50 pages, or just 1 tweet, etc.).
But let’s imagine your example having 1000 news articles (around 1 page each) and 5-10 keywords representing some news categories or similar.

In this case, here’s a possible approach:

  1. Generating Keywords and Indexing:

    • We can use ChatGPT to come up with 7-10 keywords for each news article.
    • These keywords can be stored in a smart way, maybe organized hierarchically, to create an index (index creation might involve many iterative improvements).
    • We’ll also keep track of which keywords correspond to which news articles.
  2. Querying and Ranking:

    • To find relevant documents, we can use a simple SQL query to filter based on the desired news categories.
    • From this subset, we’ll present the documents to ChatGPT and ask which ones match the keywords the best.
    • Based on ChatGPT’s response, we’ll rank and prioritize the documents to show the most relevant ones first.

Again, just to reiterate I have no real experience in this and maybe this might not work at all :slight_smile: But I think I would try something like this and iterative upgrading might lead to some working solution… I don’t know… :slight_smile: What do you think?

P.S. to make things more concrete I could imagine the prompt for creating index:
“”“Create 7-10 words keywords of this article for indexing. The keywords will be later used for queries so make these keywords 1 or 2 words long and as concise as possible. Output in JSON format.”“”
When quering:
“”“I have these unique keywords in my dabatase that were created from articles:
<kw1>
<kw2>
…
Match the following user keywords that are likely to best match the artiles:
<usr_kw1>
<usr_kw2>
…
“””
Then you retrieve the documents with regular SQL.
Then you again ask ChatGPT to rank them by relevance.

Hey @arvyzukai,
You say you don’t have any experience, but your answer seems to be an amazing one. You just laid out all the steps in which we can perform information retrieval, using the ChatGPT API. As for the course, I haven’t gone over it yet, but will definitely check it out.

Initially after reading your answer, I thought that what you proposed is only suitable for a one-off process, and not if we want to build a information retrieval application, but after giving it another thought, your approach seems to be highly scalable as well.

For instance, say for our application, we have 1 million 1 page medical documents. Since we only need to extract the keywords for each of the documents only once, we can easily do this via ChatGPT API, and store them in a hierarchical database as you mentioned. Now, for each of the user queries, we just need to find the most similar keywords to the user’s from the unique keywords that ChatGPT generated, and then a SQL query to find the top 10 documents, and the just one another ChatGPT to rank those 10 documents, and voila, we are done :mage:

Once again, an amazing answer indeed :star:

Cheers,
Elemento

1 Like

Thank you @Elemento for your kind words! I truly appreciate your praise.

In reality this might or might not work :slight_smile: and I would be very cautions/reluctant on endeavor with “medical documents” :wink:

Cheers

P.S. btw, if you have time, LangChain for LLM Application Development might also offer additional ideas on this topic

Hey @arvyzukai,

Indeed, that is true. I just considered medical documents for a toy example, but I believe that for most domain-specific applications, specific models might be able to outshine ChatGPT in terms of performance. This gap could be increased if the samples in our dataset are not exposed to the public use, and hence, chances exist that ChatGPT isn’t trained on any related documents, as such.

Indeed, that is on my to-do list, since the day it released :joy:

Cheers,
Elemento