Found a very insightful post from Guido Appenzeller:
Super interesting research paper from Percy Liang and crew at Stanford University. In essence, long context windows for transformers don’t work (yet). In practice, this means vector databases are here to stay.
LLM’s like Llama or GPT-4 have a limited context size (i.e. context window). The paper finds that LLMs with long context windows perform poorly unless the important information is at the beginning or the end of the input. Typical LLM input today is max 15-60 pages of text (8k-32k tokens), so the relevant information needs to be on the first or last page. This makes long context windows much less useful.
Vector databases avoid this issue by retrieving only relevant chunks of text via search, and feed a smaller amount of data into the LLM. This is already the dominant architecture today for cost reasons. This result is that this architecture is here to stay.
I imagine that even when using a vector database, this is really interesting to keep in mind for when you structure your prompt.
For instance, to make the most use of the performance of an LLM, I’d set it up like this
{system} (What does the LLM identify itself as? Could be something like “A helpful assistant called Chat”
{vectordb} (Whatever the vector database likes to add as context)
{context} (Previous chat information. Put under the vector database information so that it makes something that looks more like a full chat log to the LLM, I imagine.)
{new message}(The message the user just sent)
This way the LLM will be more likely to stick to it’s identity and the question provided by the user I imagine.
Another interesting option is maybe summarizing part of the chat that leads up to this? That would probably reduce hallucination since your context is reduced, but could introduce inaccurate information if the summary is incorrect. Plus, how would we go about updating this summary?