Hey everyone—quick question.
I’ve been seeing a pattern lately: teams invest in better models, tweak prompts, add tools… and yet their AI bill doesn’t drop. Sometimes it even creeps up, even when user traffic stays stable.
That made me wonder whether the root cause is less about “model pricing” and more about how often you’re effectively reusing work.
So I’m curious: how are you handling caching and reuse in your AI systems?
When people say “we cache,” I often find they cache the obvious part (like embeddings or final responses), but the expensive part still gets recomputed. In practice, the cost might be leaking through:
-
repeated requests that look similar but aren’t token-identical
-
tool call results that aren’t cached (or are cached with too-short TTLs)
-
agent steps that re-run retrieval / planning even when the inputs haven’t changed
-
context/history replay that defeats cache hits
My working theory (and what I’ve tried)In systems with orchestration (multi-step, tool use, routing), cost is driven by the number of “unique execution paths”, not just the number of users. If caching doesn’t recognize execution equivalence, you end up paying for the same reasoning multiple times.
For example, two requests might have:
- the same user intent
- similar retrieved facts
- the same tool outputs …but different message ordering, timestamps, or system prompt variants—so the cache key misses. What I recommend checking first.
Some questions:
Do you measure cache hit rate end-to-end? If yes, what are your biggest cost contributors that still don’t get cached?
How do you define cache keys so they don’t miss due to tiny prompt differences?
If you share your approach (even rules of thumb), I’d love to compare notes. I’m especially interested in what actually works in production, not just what sounds good in theory.