Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As an aside, I built a tool to manage my own chat interface over the provider APIs. I added caching because the savings are quite significant and I have a little countdown timer that shows me how much time remaining until the cache is expired.

However, for the basic turn-based conversation the cache (at 5 minutes) is almost always insufficient. By the time I read the LLM response, consider my next question, write it out, etc. I frequently miss the cache.

I imagine it is much more useful if you have a tool that has a common prefix (like a system instruction, tool specs or common set of context across many users).

If you can get it to work frequently enough the savings are quite worth it.



give it a skill that runs a timer in the background and every 4.5 minutes says "ping? pong!"


Interesting idea. I suppose one could also have response settings (e.g. max response tokens) to ensure the model doesn't waffle on and run up costs. In a best-case scenario "ping" would be one or two input tokens and a "pong" response would be one or two output tokens, so the cost of the operation would be the preserved context size times the cache read cost (one could avoid doing a cache write since I believe the cache read would reset the platforms cache timer).

It would be interesting to graph the cost/savings of this approach based on context length, percent cached, etc.

The UI for this is a bit tricky, I could mark conversations as "active" and then do the ping/pong dance on only active conversations and up to some determined max cached (e.g. 1 hour).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: