Module5 || caching || latency vs response quality

how do you maintain the count of a prompt ? is it by reframing the users query to a refined prompt and then saving it in the cache with a count ? so that when we get the similar refined prompt using similarity score, we return the response stored in cache?

Hello Rohan_Devaki.

When we talk about cache management, it’s not necessary to keep a count of prompts; the caches are saved according to what we consider necessary; it’s within our control.

In this case, the idea of ​​using the cache is to improve latency.

It’s suggested to maintain a cache of frequently sent prompts along with their responses. This allows, when a new prompt is received, the similarity with the cached prompts can be calculated; if a similar prompt is found in the cache, the stored response can be returned, which avoids the slower generation process.

In the case you’re suggesting, of “restructuring the user’s query into a refined request”, this can be saved in the cache as a new request, so that when a similar request is made, as you’re suggesting, the response will only be obtained from the cache.

The idea of ​​using cache and similarity helps optimize the system’s response.

Regards

Ronweld B.