In the second video, titled Introduction to embedding models at about the 3:00 mark, the discussion is on “algorithms for optimal retrieval”, where the speaker talks about using a Cross-Encoder to determine relevance.
One of the drawbacks is stated as : “requires you to run the classification operation for every text chunk in your dataset, so this doesn’t scale”
This Cross-Encoder approach is then juxtaposed to “sentence embedding models” and an ingestion flow.
The speaker goes on to contrast the “sentence embedding approach” with the “cross-encoding approach”, as though these are competing approaches with efficiency / accuracy tradeoffs.
My question, and what confuses me is how the cross-encoding approach can be done in isolation, as a wholistic solution. What would this look like, without a vector database to retrieve from?
Is the speaker suggesting that what an implementation could do is to
(1) iterate over the source documents / every “text chunk” in your dataset, concatenating: Question, Separator, Answer for every text chunk in the source dataset and computing relevancy
(2) taking top relevancy to send to LLM for inference?
What confuses me is that the cross-encoding is presented as an independent approach to retrieval here, which I’ve never heard of before. I’ve only heard of cross-encoders being used for “re-ranking” (which implies they’ve already been ranked, presumably by something like “sentence embedding models”).
Are there any sources anyone can link where a cross-encoder only is used for retrieval?
While the encoder-decoder architecture can handle sequential data effectively, it struggles with long-range dependencies and may fail to capture relevant information from distant parts of the input sequence. This is where the attention mechanism comes into play.
I don’t know if you have Natural Language processing Specialisation, the approach of cross-encoder would be more from a perspective of having attention mechanism where the attention mechanism is a crucial component that allows the decoder to selectively focus on different parts of the input sequence when generating each output word.
It computes a context vector, which is a weighted sum of the encoder’s hidden state vectors, where the weights are dynamically calculated based on the relevance of each input word to the current decoding step.
This context vector, along with the decoder’s hidden state and the previously generated word, is used to predict the next word in the target sequence.
I am attaching a file which could help one understand how this works.
Hi @0Ddumas0
Yes, that is exactly what I meant. You can (hypothetically) store each chunk of text (it won’t be in a vector database per-se, rather just in a text database). Then during retrieval, you could use the cross-encoder with each question/chunk pair and rank it that way, then select the top chunks to send to the LLM for generation.
You are correct to note that “I’ve never heard of this before” because it’s not really something that is practical. I only mention it to explain why we use reranking as a 2nd step and not to replace embedding (it’s too slow and not practical).
I was just going through the course, and I have a query or more like a discussion, don’t you feel the cross-encoder is still dependent on the answer fed by coder to be able to work detecting the right embedding.
for example
what if the
answer fed for place where its name was changed
Like what is capital of Karnataka?
Answer fed:
Bengaluru is the capital of Karnataka
Mysuru is the capital of Karnataka
Bangalore is the capital of Karnataka
So when I run down this code, this was the output: [0.9996363 0.999736 0.9997397]
The most relevant passage is: Bengaluru is the capital of Karnataka.
Although it is totally correct related to the current update, but why could it also include the pretext that Bengaluru was renamed from Bangalore, to be more precise as the model also uses cosine similarity and the Glove Embedding of contextual relevance.
I am sorry I should have explained more in detail about the reason I chose the question.
So before independence the capital of Karnataka used to be Mysore(Mysuru)
Currently the capital of Karnatak is Bengaluru (which was previously called Bangalore)
So the output score again seem to scoring higher for Mysuru and Bangalore and yet choosing Bangalore as best answer, unlike the RAG is not able to understand the difference between is and was** which clearly shows according to the output score it is not able to as the score is higher for Mysuru too.
But cross-encoder doesn’t seem to hold significance provided all the above details, as I still didn’t get the reasoning behind choosing the best answer? If see the output scores, Bangalore seems to be the right answer!!!
I am just trying to make the Model more perfect
A embedding model should be able to correlate the answers and give the best response