Module5 || caching || latency vs response quality

Rohan_Devaki · October 1, 2025, 11:41am

how do you maintain the count of a prompt ? is it by reframing the users query to a refined prompt and then saving it in the cache with a count ? so that when we get the similar refined prompt using similarity score, we return the response stored in cache?

ribarola · October 15, 2025, 5:58pm

Hello Rohan_Devaki.

When we talk about cache management, it’s not necessary to keep a count of prompts; the caches are saved according to what we consider necessary; it’s within our control.

In this case, the idea of using the cache is to improve latency.

It’s suggested to maintain a cache of frequently sent prompts along with their responses. This allows, when a new prompt is received, the similarity with the cached prompts can be calculated; if a similar prompt is found in the cache, the stored response can be returned, which avoids the slower generation process.

In the case you’re suggesting, of “restructuring the user’s query into a refined request”, this can be saved in the cache as a new request, so that when a similar request is made, as you’re suggesting, the response will only be obtained from the cache.

The idea of using cache and similarity helps optimize the system’s response.

Regards

Ronweld B.

Topic		Replies	Views
Caching using Cohere and GPTCache AI Discussions	0	84	October 10, 2023
✨ New course! Enroll in Semantic Caching for AI Agents News and Announcements ai-discussions , short-course , dl-ai-learning-platform	0	245	November 19, 2025
Response cut-off for llama 8B intruct Generative AI with Large Language Models ai-discussions	6	218	August 27, 2024
How can I optimize cost of ChatGPT when prompting? AI Discussions ai-discussions	4	200	October 22, 2023
Does updates of new relevant info to LLM permanently/temporarily update the LLM's model Retrieval Augmented Generation week-module-1 , ai-discussions , coursera-platform	1	62	January 9, 2026

Module5 || caching || latency vs response quality

Related topics