Was able to sign up for the Max plan & start using it via opencode. It does a wa...

NitpickLawyer · 2025-11-08T06:21:31 1762582891

> For reference, each new request needs to send all previous messages - tool calls force new requests too. So it's essentially cumulative when you're chatting with an agent - my opencode agent's context window is only 50% used at 72k tokens, but Cerebra's tracking online shows that I've used 1M input tokens and 10k output tokens already.

This is how every "chatbot" / "agentic flow" / etc works behind the scenes. That's why I liked that "you should build an agent" post a few days ago. It gets people to really understand what's behind the curtain. It's requests all the way down, sometimes with more context added, sometimes with less (subagents & co).

embedding-shape · 2025-11-09T13:23:04 1762694584

Many API endpoints (and local services for that matter) does caching at this point though, with much cheaper prices for input/outputs that were found in the caching. I know Anthrophic does this, and DeepSeek I think too, at the very least.

zaptrem · 2025-11-08T09:27:53 1762594073

They don't have prefix caching? Claude and Codex have this.

versteegen · 2025-11-08T09:58:55 1762595935

At those speeds, it's probably impossible. It would require enormous amounts of memory (which the chip simply doesn't have, there's no room for it) or rather a lot of bandwidth off-chip to storage, and again they wouldn't want to waste surface area on the wiring. Bit of a drawback of increasing density.