HN Reader

306

This is a surprising good read of how LLM works in general.

Does anyone know whether the cache is segregated by user/API key for the big providers?

Was looking at modifying outgoing requests via proxy and wondering whether that's harming caching. Common coding tools presumably have a shared prompt across all their installs so universal cache would save a lot

1 month agoby Havoc

It was a real facepalm moment when I realised we were busting the cache on every request by including date time near the top of the main prompt.

Even just moving it to the bottom helped move a lot of our usage into cache.

Probably went from something like 30-50% cached tokens to 50-70%.

1 month agoby duggan

A really clear explanation!

So if I were running a provider I would be caching popular prefixes for questions across all users. There must be so many questions that start 'what is' or 'who was' etc?

Also, can subsequences in the prompt be cached and reused? Or is it only prefixes? I mean, can you cache popular phrases that might appear in the middle of the prompt and reuse that somehow rather than needing to iterate through them token by token? E.g. must be lots of times that "and then tell me what" appears in the middle of a prompt?

1 month agoby willvarfar

When will Microsoft do this sort of thing?

It's a pain having to tell Copilot "Open in pages mode" each time it's launched, and then after processing a batch of files run into:

https://old.reddit.com/r/Copilot/comments/1po2cuf/daily_limi...

1 month agoby WillAdams

I gave the table of inputs and outputs to both Gemini 3.0 flash and GPT 5.2 instant and they were stumped.

https://t3.chat/share/j2tnfwwful https://t3.chat/share/k1xhgisrw1

1 month agoby holbrad

But why is this posted on ngrok?

1 month agoby dangoodmanUT

What a fantastic article! How did you create the animations?

1 month agoby who-shot-jr