KV cache is just that: a cache. The inference logic of an LLM remains the same. ...

andy12_ · 2025-07-08T15:57:35 1751990255

The same can be said about any recurrent network. To predict the token n+1 you could recalculate the hidden state up to token n, or reuse the hidden state of token n from the previous forward pass. The only difference is the amount of memory and computation.

The thing is that, fundamentally, an auto-regressive transformer is a model whose state grows linearly with each token without compression, which is what bestows them with (theoretical) perfect recall.